%%html
<style>
.qst {padding:10px; border-radius: 5px; border: solid 2px #EF6E1A;}
.qst:before {font-weight: bold;display: block; margin: 0px 10px 10px 10px;}
h1,h3 {color: #EF6E1A;}
</style>
To develop a predictive model that effectively differentiates between good and risky customers for Bankaya, thereby aiding in more accurate credit approval decisions. The model will specifically focus on new customers applying for their first loan to purchase a smartphone. Each step in the development process must be rigorously justified and documented to ensure both efficacy and transparency.
You can find the repository in Github in the following link: https://github.com/nahim21/Bankaya_risk_model
#!pip install imbalanced-learn
#!pip install mlxtend
# -- Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.offline as pyo
import json
import sys
pyo.init_notebook_mode(connected=True)
#pio.renderers.default = "notebook"
pd.set_option('display.max_columns', None)
# ML functions
from ML_functions.plots import plot_histogram
from ML_functions.util_functions import plot_diagnostic_outlier,select_diagnostic_outliers
from ML_functions.plots import plot_pie, plot_time_series, matrix_correlation
# --- Model functions
from imblearn.pipeline import Pipeline as ImbPipeline
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
import category_encoders as ce
import warnings
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
# Performance Metrics
from sklearn.metrics import accuracy_score,roc_auc_score
from sklearn.metrics import confusion_matrix,classification_report
from mlxtend.plotting import plot_confusion_matrix
from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import f1_score
import matplotlib.pyplot as plt
# ======== LIBRERIAS MODELLING =======
from sklearn.linear_model import LogisticRegression
import xgboost as xgb
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.dummy import DummyClassifier
warnings.filterwarnings("ignore")
# --- Load the datasets
customer_loan_df = pd.read_parquet("input/main_dataset.parquet") #---> main dataset
credit_report_df = pd.read_parquet("input/credit_reports.parquet") # ---> credit reports
# --- Examine the first columns of the customer dataset
print(f"The shape of the dataset is: {customer_loan_df.shape}")
customer_loan_df.head(7)
The shape of the dataset is: (14454, 17)
| customer_id | loan_id | ACC_CREATION_DATETIME | APPLICATION_DATETIME | LOAN_ORIGINATION_DATETIME | max_days_late | target | account_to_application_days | n_sf_apps | first_app_date | last_app_date | n_bnpl_apps | n_bnpl_approved_apps | first_bnpl_app_date | last_bnpl_app_date | n_inquiries_l3m | n_inquiries_l6m | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1223 | 1 | 2021-08-23 13:57:56 | 2022-04-26 07:00:00 | 2022-10-20 18:26:35 | 5 | 0 | 245 | 1.0 | 2022-04-25 | 2022-04-25 | 1.0 | 0.0 | 2022-04-26 02:15:17.742 | 2022-04-26 02:15:17.742 | 91.0 | 91.0 |
| 1 | 5190 | 2 | 2022-04-26 09:57:25 | 2022-04-26 07:00:00 | 2022-12-22 18:03:32 | 2 | 0 | 0 | NaN | NaT | NaT | NaN | NaN | NaT | NaT | NaN | NaN |
| 2 | 5194 | 3 | 2022-04-26 12:22:35 | 2022-04-26 07:00:00 | 2022-08-09 17:31:05 | 4 | 0 | 0 | NaN | NaT | NaT | NaN | NaN | NaT | NaT | NaN | NaN |
| 3 | 3978 | 4 | 2022-03-09 11:26:55 | 2022-04-26 07:00:00 | 2022-11-11 19:48:08 | 2 | 0 | 47 | 1.0 | 2022-03-17 | 2022-03-17 | 2.0 | 0.0 | 2022-03-19 22:02:50.555 | 2022-03-09 16:41:49.675 | 13.0 | 13.0 |
| 4 | 4535 | 5 | 2022-04-01 14:28:42 | 2022-04-26 07:00:00 | 2022-07-24 12:07:15 | 2 | 0 | 24 | 1.0 | 2022-04-01 | 2022-04-01 | 1.0 | 0.0 | 2022-04-01 19:37:56.185 | 2022-04-01 19:37:56.185 | 21.0 | 21.0 |
| 5 | 3604 | 6 | 2022-02-21 11:55:32 | 2022-05-05 07:00:00 | 2022-12-08 07:12:18 | 7 | 0 | 72 | 1.0 | 2022-02-21 | 2022-02-21 | NaN | NaN | NaT | NaT | 26.0 | 26.0 |
| 6 | 271 | 7 | 2021-04-06 16:30:21 | 2022-05-05 07:00:00 | 2023-04-21 15:40:10 | 0 | 0 | 393 | NaN | NaT | NaT | 1.0 | 0.0 | 2022-05-04 23:18:03.604 | 2022-05-04 23:18:03.604 | 11.0 | 11.0 |
# --- Examine the first columns of the customer dataset
print(f"The shape of the dataset is: {credit_report_df.shape}")
credit_report_df.head(7)
The shape of the dataset is: (287356, 29)
| customer_id | INQUIRY_TIME | CDC_INQUIRY_ID | INQUIRY_DATE | PREVENTION_KEY | CURRENCY | MAX_CREDIT | CREDIT_LIMIT | PAYMENT_AMOUNT | UPDATE_DATE | LOAN_OPENING_DATE | LOAN_CLOSING_DATE | WORST_DELAY_DATE | REPORT_DATE | LAST_PURCHASE_DATE | LAST_PAYMENT_DATE | PAYMENT_FREQUENCY | BUSINESS_TYPE | CREDIT_TYPE | ACCOUNT_TYPE | RESPONSABILITY_TYPE | TOTAL_PAYMENTS | DELAYED_PAYMENTS | CURRENT_PAYMENT | WORST_DELAY | TOTAL_REPORTED_PAYMENTS | CURRENT_BALANCE | BALANCE_DUE | BALANCE_DUE_WORST_DELAY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 4223 | 2022-04-01 00:32:36.689000+00:00 | 710278-27993a6e-2885-48d4-a753-9249e7022af1 | 2022-04-01 | SIN OBSERVACION | PESOS MEXICANOS | 1244.0 | 9900.0 | 1244.0 | 2022-02-28 | 2019-08-30 | NaT | 2022-01-31 | 2022-02-28 | 2019-08-30 | 2020-08-18 | MENSUAL | TIENDA DEPARTAMENTAL | APARATOS/MUEBLES | PAGOS FIJOS | INDIVIDUAL (TITULAR) | 12.0 | 7 | 13 | 13.0 | 0 | 1244.0 | 1244.0 | 1244.0 |
| 1 | 4223 | 2022-04-01 00:32:36.689000+00:00 | 710278-27993a6e-2885-48d4-a753-9249e7022af1 | 2022-04-01 | SIN OBSERVACION | PESOS MEXICANOS | 1722.0 | 9900.0 | 1722.0 | 2022-02-28 | 2020-04-21 | NaT | 2021-12-31 | 2022-02-28 | 2020-04-21 | 2020-08-18 | PAGO MINIMO PARA CUENTAS REVOLVENTES | TIENDA DEPARTAMENTAL | LINEA DE CREDITO | REVOLVENTE | INDIVIDUAL (TITULAR) | 0.0 | 3 | 13 | 13.0 | 0 | 1722.0 | 1722.0 | 1722.0 |
| 2 | 4223 | 2022-04-01 00:32:36.689000+00:00 | 710278-27993a6e-2885-48d4-a753-9249e7022af1 | 2022-04-01 | SIN OBSERVACION | PESOS MEXICANOS | 6346.0 | 0.0 | 482.0 | 2022-03-28 | 2021-12-17 | NaT | NaT | 2022-03-28 | 2021-12-17 | 2022-03-26 | SEMANAL | BANCOS | PRESTAMO PERSONAL | PAGOS FIJOS | INDIVIDUAL (TITULAR) | 16.0 | 0 | V | 0.0 | 0 | 943.0 | 0.0 | 0.0 |
| 3 | 3490 | 2022-02-15 02:30:22.086000+00:00 | 622857-6b4e9d95-7491-40c3-bccd-442de7f94c58 | 2022-02-15 | SIN OBSERVACION | PESOS MEXICANOS | 11600.0 | 0.0 | 232.0 | 2022-02-01 | 2021-03-02 | NaT | 2022-01-04 | 2022-02-01 | 2021-03-02 | 2022-01-25 | SEMANAL | TIENDA DE ROPA | PRESTAMO PERSONAL | PAGOS FIJOS | INDIVIDUAL (TITULAR) | 100.0 | 4 | 01 | 1.0 | 0 | 6185.0 | 116.0 | 116.0 |
| 4 | 6486 | 2022-06-25 01:57:14.868000+00:00 | 875073-46a5f149-19db-4193-b92a-b8f41fb3e896 | 2022-06-25 | SIN OBSERVACION | PESOS MEXICANOS | 2452.0 | 16800.0 | 2452.0 | 2021-09-30 | 2019-02-22 | NaT | 2020-08-31 | 2021-09-30 | 2019-04-16 | 2019-06-12 | PAGO MINIMO PARA CUENTAS REVOLVENTES | TIENDA DEPARTAMENTAL | LINEA DE CREDITO | REVOLVENTE | INDIVIDUAL (TITULAR) | 0.0 | 19 | 13 | 13.0 | 0 | 2452.0 | 2452.0 | 2452.0 |
| 5 | 6486 | 2022-06-25 01:57:14.868000+00:00 | 875073-46a5f149-19db-4193-b92a-b8f41fb3e896 | 2022-06-25 | SIN OBSERVACION | PESOS MEXICANOS | 0.0 | 0.0 | 0.0 | 2022-05-31 | 2021-11-30 | NaT | NaT | 2022-05-31 | 2022-05-16 | 2022-05-16 | PAGO MINIMO PARA CUENTAS REVOLVENTES | SERVICIO DE TELEVISION DE PAGA | OTROS | SIN LIMITE PREESTABLECIDO | INDIVIDUAL (TITULAR) | 0.0 | 0 | V | 0.0 | 0 | 0.0 | 0.0 | 0.0 |
| 6 | 4075 | 2022-03-13 01:46:41.593000+00:00 | 693356-d32e8877-2ba0-48c0-8366-08d733aa974e | 2022-03-13 | SIN OBSERVACION | PESOS MEXICANOS | 7783.0 | 0.0 | 0.0 | 2004-09-23 | 2003-08-05 | 2004-04-27 | NaT | 2004-09-23 | 2003-08-05 | 2004-04-27 | SEMANAL | MERCANCIA PARA HOGAR Y OFICINA | CREDITO AL CONSUMO | PAGOS FIJOS | INDIVIDUAL (TITULAR) | 39.0 | 0 | V | 0.0 | 0 | 0.0 | 0.0 | 0.0 |
# --- Examine the basic statistics of the customer_loan_df
customer_loan_df.describe().T
| count | mean | min | 25% | 50% | 75% | max | std | |
|---|---|---|---|---|---|---|---|---|
| customer_id | 14454.0 | 7227.5 | 1.0 | 3614.25 | 7227.5 | 10840.75 | 14454.0 | 4172.654731 |
| loan_id | 14454.0 | 7227.5 | 1.0 | 3614.25 | 7227.5 | 10840.75 | 14454.0 | 4172.654731 |
| ACC_CREATION_DATETIME | 14454 | 2022-06-17 07:24:49.443337216 | 2020-10-14 18:22:10 | 2022-02-21 18:46:22.249999872 | 2022-07-19 20:29:43.500000 | 2022-11-13 07:37:39.249999872 | 2023-05-19 19:55:04 | NaN |
| APPLICATION_DATETIME | 14454 | 2022-11-28 03:42:40.896637696 | 2022-04-26 07:00:00 | 2022-09-15 13:00:00 | 2022-12-20 08:00:00 | 2023-02-04 08:00:00 | 2023-05-26 07:00:00 | NaN |
| LOAN_ORIGINATION_DATETIME | 14454 | 2022-12-28 06:04:09.504220416 | 2022-07-01 09:03:20 | 2022-10-27 21:15:58.249999872 | 2023-01-11 10:05:49.500000 | 2023-03-06 18:07:46.249999872 | 2023-05-29 12:18:28 | NaN |
| max_days_late | 14454.0 | 14.225889 | -7.0 | 0.0 | 2.0 | 20.0 | 70.0 | 21.738445 |
| target | 14454.0 | 0.1868 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.389764 |
| account_to_application_days | 14454.0 | 163.489 | 0.0 | 0.0 | 103.0 | 271.75 | 901.0 | 181.110989 |
| n_sf_apps | 6806.0 | 1.653982 | 1.0 | 1.0 | 1.0 | 2.0 | 42.0 | 1.697131 |
| first_app_date | 6806 | 2022-06-16 04:31:39.970614272 | 2021-04-27 00:00:00 | 2022-02-27 00:00:00 | 2022-07-15 00:00:00 | 2022-10-21 00:00:00 | 2023-05-12 00:00:00 | NaN |
| last_app_date | 6806 | 2022-06-16 01:42:11.531002112 | 2021-04-25 00:00:00 | 2022-02-25 00:00:00 | 2022-07-16 00:00:00 | 2022-10-22 00:00:00 | 2023-05-12 00:00:00 | NaN |
| n_bnpl_apps | 8739.0 | 1.221765 | 1.0 | 1.0 | 1.0 | 1.0 | 18.0 | 0.831144 |
| n_bnpl_approved_apps | 8739.0 | 0.264904 | 0.0 | 0.0 | 0.0 | 0.0 | 15.0 | 0.602481 |
| first_bnpl_app_date | 8739 | 2022-08-13 09:30:46.852090880 | 2022-01-06 21:17:08.193000 | 2022-05-01 21:03:56.963500032 | 2022-08-18 13:36:14.271000064 | 2022-11-06 19:24:55.189499904 | 2023-05-20 17:15:47 | NaN |
| last_bnpl_app_date | 8739 | 2022-08-03 09:11:07.589265408 | 2022-01-06 21:17:08.193000 | 2022-04-20 05:33:33.585999872 | 2022-07-28 17:37:41.676999936 | 2022-11-06 01:50:47.641999872 | 2023-05-17 15:20:48 | NaN |
| n_inquiries_l3m | 9083.0 | 10.350435 | 0.0 | 0.0 | 0.0 | 14.0 | 170.0 | 19.694595 |
| n_inquiries_l6m | 9083.0 | 17.11483 | 0.0 | 0.0 | 8.0 | 26.0 | 213.0 | 23.229088 |
Loan information
Loan Creation and Application Dates
BNPL (Buy Now Pay Later) Applications:
SF (Smartphone Finance) Applications:
Credit Bureau Inquiries
# --- Examine the basic statistics of the customer_loan_df
credit_report_df.describe().T
| count | mean | min | 25% | 50% | 75% | max | std | |
|---|---|---|---|---|---|---|---|---|
| customer_id | 287356.0 | 6333.768075 | 1.0 | 2776.0 | 6187.0 | 9760.0 | 14416.0 | 3989.039684 |
| INQUIRY_DATE | 287267 | 2022-07-02 00:44:25.383771392 | 2021-04-29 00:00:00 | 2022-03-22 00:00:00 | 2022-07-18 00:00:00 | 2022-10-27 00:00:00 | 2023-05-17 00:00:00 | NaN |
| MAX_CREDIT | 287174.0 | 15280.383928 | 0.0 | 548.0 | 2300.0 | 7483.0 | 404040416.0 | 1313088.110101 |
| CREDIT_LIMIT | 278999.0 | 5525.841766 | 0.0 | 0.0 | 400.0 | 3000.0 | 1900000.0 | 27882.692051 |
| PAYMENT_AMOUNT | 287267.0 | 1670.948417 | 0.0 | 0.0 | 0.0 | 149.0 | 1800000.0 | 13010.321018 |
| UPDATE_DATE | 287267 | 2020-03-26 04:35:03.002433024 | 2001-04-25 00:00:00 | 2019-03-13 00:00:00 | 2021-11-22 00:00:00 | 2022-06-15 00:00:00 | 2023-05-07 00:00:00 | NaN |
| LOAN_OPENING_DATE | 287267 | 2018-11-22 15:24:26.870889984 | 1950-01-01 00:00:00 | 2017-03-21 00:00:00 | 2020-04-02 00:00:00 | 2021-10-14 00:00:00 | 2023-03-31 00:00:00 | NaN |
| LOAN_CLOSING_DATE | 192609 | 2018-10-24 21:45:09.910752 | 2000-08-28 00:00:00 | 2017-01-30 00:00:00 | 2020-01-27 00:00:00 | 2021-09-20 00:00:00 | 2023-03-31 00:00:00 | NaN |
| WORST_DELAY_DATE | 84658 | 2019-08-11 02:14:42.969123072 | 1998-10-10 00:00:00 | 2018-04-30 00:00:00 | 2021-05-31 00:00:00 | 2022-04-20 00:00:00 | 2023-04-12 00:00:00 | NaN |
| REPORT_DATE | 287267 | 2020-03-22 05:43:32.962853376 | 2001-04-25 00:00:00 | 2019-02-28 00:00:00 | 2021-11-10 00:00:00 | 2022-06-12 00:00:00 | 2023-05-07 00:00:00 | NaN |
| LAST_PURCHASE_DATE | 283214 | 2019-02-17 08:26:14.566228992 | 1950-01-01 00:00:00 | 2017-07-29 00:00:00 | 2020-08-03 00:00:00 | 2021-11-16 00:00:00 | 2023-04-07 00:00:00 | NaN |
| LAST_PAYMENT_DATE | 255679 | 2019-04-09 00:51:00.575174400 | 1950-01-01 00:00:00 | 2017-09-02 00:00:00 | 2020-09-06 00:00:00 | 2021-12-27 00:00:00 | 2023-05-07 00:00:00 | NaN |
| TOTAL_PAYMENTS | 268711.0 | 22.773876 | 0.0 | 1.0 | 3.0 | 16.0 | 1800.0 | 89.005749 |
| WORST_DELAY | 284146.0 | 4.27879 | 0.0 | 0.0 | 0.0 | 2.0 | 84.0 | 12.231546 |
| CURRENT_BALANCE | 287267.0 | 4578.482927 | 0.0 | 0.0 | 0.0 | 273.0 | 3469743.0 | 35543.251719 |
| BALANCE_DUE | 287267.0 | 2090.320639 | 0.0 | 0.0 | 0.0 | 0.0 | 1800000.0 | 14111.862995 |
| BALANCE_DUE_WORST_DELAY | 287267.0 | 1671.461922 | 0.0 | 0.0 | 0.0 | 159.0 | 1800000.0 | 11646.107604 |
Credit Limits & Usage
Payments
Delays
Potential Data Quality Issues:
Given that we're interested in a comprehensive view of each customer's credit behavior, an Inner Join would be the most straightforward and effective. This ensures that we only consider customers for whom we have completed information from both internal and external sources.
# --- Merge the datasets using the customer_id column
bankaya_merged = pd.merge(customer_loan_df,credit_report_df, on='customer_id',how='inner')
# --- Examine the first columns of the new dataset
print(f"The shape of the merged dataset is: {bankaya_merged.shape}")
print(f'previous customers amount: {len(customer_loan_df.customer_id.unique())}')
print(f'current customers amount: {len(bankaya_merged.customer_id.unique())}')
bankaya_merged.head(5)
The shape of the merged dataset is: (287356, 45) previous customers amount: 14454 current customers amount: 9249
| customer_id | loan_id | ACC_CREATION_DATETIME | APPLICATION_DATETIME | LOAN_ORIGINATION_DATETIME | max_days_late | target | account_to_application_days | n_sf_apps | first_app_date | last_app_date | n_bnpl_apps | n_bnpl_approved_apps | first_bnpl_app_date | last_bnpl_app_date | n_inquiries_l3m | n_inquiries_l6m | INQUIRY_TIME | CDC_INQUIRY_ID | INQUIRY_DATE | PREVENTION_KEY | CURRENCY | MAX_CREDIT | CREDIT_LIMIT | PAYMENT_AMOUNT | UPDATE_DATE | LOAN_OPENING_DATE | LOAN_CLOSING_DATE | WORST_DELAY_DATE | REPORT_DATE | LAST_PURCHASE_DATE | LAST_PAYMENT_DATE | PAYMENT_FREQUENCY | BUSINESS_TYPE | CREDIT_TYPE | ACCOUNT_TYPE | RESPONSABILITY_TYPE | TOTAL_PAYMENTS | DELAYED_PAYMENTS | CURRENT_PAYMENT | WORST_DELAY | TOTAL_REPORTED_PAYMENTS | CURRENT_BALANCE | BALANCE_DUE | BALANCE_DUE_WORST_DELAY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1223 | 1 | 2021-08-23 13:57:56 | 2022-04-26 07:00:00 | 2022-10-20 18:26:35 | 5 | 0 | 245 | 1.0 | 2022-04-25 | 2022-04-25 | 1.0 | 0.0 | 2022-04-26 02:15:17.742 | 2022-04-26 02:15:17.742 | 91.0 | 91.0 | 2022-04-25 21:58:16.422000+00:00 | 221105-252f8dc6-04b8-41dc-92c3-bd9902478b8a | 2022-04-25 | SIN OBSERVACION | PESOS MEXICANOS | 3000.0 | 0.0 | 0.0 | 2017-04-19 | 2013-03-27 | 2017-02-21 | NaT | 2017-02-28 | 2017-02-21 | 2013-03-27 | MENSUAL | SERVICIOS | LINEA DE CREDITO | SIN LIMITE PREESTABLECIDO | INDIVIDUAL (TITULAR) | NaN | 0 | V | NaN | None | 0.0 | 0.0 | 0.0 |
| 1 | 1223 | 1 | 2021-08-23 13:57:56 | 2022-04-26 07:00:00 | 2022-10-20 18:26:35 | 5 | 0 | 245 | 1.0 | 2022-04-25 | 2022-04-25 | 1.0 | 0.0 | 2022-04-26 02:15:17.742 | 2022-04-26 02:15:17.742 | 91.0 | 91.0 | 2022-04-25 21:58:16.422000+00:00 | 221105-252f8dc6-04b8-41dc-92c3-bd9902478b8a | 2022-04-25 | SIN OBSERVACION | PESOS MEXICANOS | 500.0 | 2200.0 | 0.0 | 2021-06-21 | 2021-04-03 | 2021-04-18 | NaT | 2021-06-21 | 2021-04-03 | 2021-04-18 | MENSUAL | COMPANIA DE PRESTAMO PERSONAL | PRESTAMO PERSONAL | PAGOS FIJOS | INDIVIDUAL (TITULAR) | 1.0 | 0 | V | 0.0 | 0 | 0.0 | 0.0 | 0.0 |
| 2 | 1223 | 1 | 2021-08-23 13:57:56 | 2022-04-26 07:00:00 | 2022-10-20 18:26:35 | 5 | 0 | 245 | 1.0 | 2022-04-25 | 2022-04-25 | 1.0 | 0.0 | 2022-04-26 02:15:17.742 | 2022-04-26 02:15:17.742 | 91.0 | 91.0 | 2022-04-25 21:58:16.422000+00:00 | 221105-252f8dc6-04b8-41dc-92c3-bd9902478b8a | 2022-04-25 | SIN OBSERVACION | PESOS MEXICANOS | 1100.0 | 2200.0 | 0.0 | 2021-06-21 | 2021-04-18 | 2021-04-22 | NaT | 2021-06-21 | 2021-04-18 | 2021-04-22 | MENSUAL | COMPANIA DE PRESTAMO PERSONAL | PRESTAMO PERSONAL | PAGOS FIJOS | INDIVIDUAL (TITULAR) | 1.0 | 0 | V | 0.0 | 0 | 0.0 | 0.0 | 0.0 |
| 3 | 1223 | 1 | 2021-08-23 13:57:56 | 2022-04-26 07:00:00 | 2022-10-20 18:26:35 | 5 | 0 | 245 | 1.0 | 2022-04-25 | 2022-04-25 | 1.0 | 0.0 | 2022-04-26 02:15:17.742 | 2022-04-26 02:15:17.742 | 91.0 | 91.0 | 2022-04-25 21:58:16.422000+00:00 | 221105-252f8dc6-04b8-41dc-92c3-bd9902478b8a | 2022-04-25 | SIN OBSERVACION | PESOS MEXICANOS | 1700.0 | 2500.0 | 0.0 | 2021-07-13 | 2021-04-22 | 2021-05-03 | NaT | 2021-07-13 | 2021-04-22 | 2021-05-03 | MENSUAL | COMPANIA DE PRESTAMO PERSONAL | PRESTAMO PERSONAL | PAGOS FIJOS | INDIVIDUAL (TITULAR) | 1.0 | 0 | V | 0.0 | 0 | 0.0 | 0.0 | 0.0 |
| 4 | 1223 | 1 | 2021-08-23 13:57:56 | 2022-04-26 07:00:00 | 2022-10-20 18:26:35 | 5 | 0 | 245 | 1.0 | 2022-04-25 | 2022-04-25 | 1.0 | 0.0 | 2022-04-26 02:15:17.742 | 2022-04-26 02:15:17.742 | 91.0 | 91.0 | 2022-04-25 21:58:16.422000+00:00 | 221105-252f8dc6-04b8-41dc-92c3-bd9902478b8a | 2022-04-25 | SIN OBSERVACION | PESOS MEXICANOS | 1800.0 | 2500.0 | 0.0 | 2021-07-13 | 2021-05-03 | 2021-05-26 | NaT | 2021-07-13 | 2021-05-03 | 2021-05-26 | MENSUAL | COMPANIA DE PRESTAMO PERSONAL | PRESTAMO PERSONAL | PAGOS FIJOS | INDIVIDUAL (TITULAR) | 1.0 | 0 | V | 0.0 | 0 | 0.0 | 0.0 | 0.0 |
# --- Examine the basic statistics of the merged dataframe
bankaya_merged.describe().T
| count | mean | min | 25% | 50% | 75% | max | std | |
|---|---|---|---|---|---|---|---|---|
| customer_id | 287356.0 | 6333.768075 | 1.0 | 2776.0 | 6187.0 | 9760.0 | 14416.0 | 3989.039684 |
| loan_id | 287356.0 | 7695.095067 | 1.0 | 4697.0 | 7871.0 | 10909.0 | 14454.0 | 4033.276971 |
| ACC_CREATION_DATETIME | 287356 | 2022-05-06 23:16:59.809368576 | 2020-10-14 18:22:10 | 2021-12-25 14:08:44 | 2022-06-16 14:19:46 | 2022-10-05 23:13:30 | 2023-04-24 18:36:17 | NaN |
| APPLICATION_DATETIME | 287356 | 2022-12-09 03:45:22.400088832 | 2022-04-26 07:00:00 | 2022-10-22 07:00:00 | 2022-12-25 08:00:00 | 2023-02-05 08:00:00 | 2023-05-26 07:00:00 | NaN |
| LOAN_ORIGINATION_DATETIME | 287356 | 2023-01-11 20:22:31.556476928 | 2022-07-01 09:03:20 | 2022-12-03 15:19:37 | 2023-01-25 19:02:11 | 2023-03-11 11:12:19 | 2023-05-29 12:18:28 | NaN |
| max_days_late | 287356.0 | 12.781772 | -7.0 | 0.0 | 2.0 | 15.0 | 70.0 | 20.510928 |
| target | 287356.0 | 0.167513 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.373434 |
| account_to_application_days | 287356.0 | 215.583043 | 0.0 | 67.0 | 173.0 | 324.0 | 901.0 | 180.294859 |
| n_sf_apps | 202674.0 | 1.649338 | 1.0 | 1.0 | 1.0 | 2.0 | 36.0 | 1.563695 |
| first_app_date | 202674 | 2022-06-24 06:29:30.638561792 | 2021-04-27 00:00:00 | 2022-03-11 00:00:00 | 2022-07-23 00:00:00 | 2022-10-24 00:00:00 | 2023-04-25 00:00:00 | NaN |
| last_app_date | 202674 | 2022-06-25 02:15:26.990141952 | 2021-04-28 00:00:00 | 2022-03-11 00:00:00 | 2022-07-26 00:00:00 | 2022-10-25 00:00:00 | 2023-04-25 00:00:00 | NaN |
| n_bnpl_apps | 267144.0 | 1.243479 | 1.0 | 1.0 | 1.0 | 1.0 | 18.0 | 0.863805 |
| n_bnpl_approved_apps | 267144.0 | 0.275219 | 0.0 | 0.0 | 0.0 | 0.0 | 15.0 | 0.601572 |
| first_bnpl_app_date | 267144 | 2022-08-20 00:47:41.840025088 | 2022-01-06 21:17:08.193000 | 2022-05-05 17:59:36.776999936 | 2022-09-02 23:50:13.225999872 | 2022-11-11 19:36:34.294000128 | 2023-05-20 17:15:47 | NaN |
| last_bnpl_app_date | 267144 | 2022-08-09 16:56:22.504993792 | 2022-01-06 21:17:08.193000 | 2022-04-22 00:44:14.604999936 | 2022-08-11 09:51:12.441999872 | 2022-11-11 03:03:34.395000064 | 2023-05-17 15:20:48 | NaN |
| n_inquiries_l3m | 285656.0 | 14.919267 | 0.0 | 0.0 | 0.0 | 24.0 | 170.0 | 25.800862 |
| n_inquiries_l6m | 285656.0 | 24.228747 | 0.0 | 0.0 | 15.0 | 39.0 | 213.0 | 29.497535 |
| INQUIRY_DATE | 287267 | 2022-07-02 00:44:25.383772416 | 2021-04-29 00:00:00 | 2022-03-22 00:00:00 | 2022-07-18 00:00:00 | 2022-10-27 00:00:00 | 2023-05-17 00:00:00 | NaN |
| MAX_CREDIT | 287174.0 | 15280.383928 | 0.0 | 548.0 | 2300.0 | 7483.0 | 404040416.0 | 1313088.110101 |
| CREDIT_LIMIT | 278999.0 | 5525.841766 | 0.0 | 0.0 | 400.0 | 3000.0 | 1900000.0 | 27882.692051 |
| PAYMENT_AMOUNT | 287267.0 | 1670.948417 | 0.0 | 0.0 | 0.0 | 149.0 | 1800000.0 | 13010.321018 |
| UPDATE_DATE | 287267 | 2020-03-26 04:35:03.002433536 | 2001-04-25 00:00:00 | 2019-03-13 00:00:00 | 2021-11-22 00:00:00 | 2022-06-15 00:00:00 | 2023-05-07 00:00:00 | NaN |
| LOAN_OPENING_DATE | 287267 | 2018-11-22 15:24:26.870890496 | 1950-01-01 00:00:00 | 2017-03-21 00:00:00 | 2020-04-02 00:00:00 | 2021-10-14 00:00:00 | 2023-03-31 00:00:00 | NaN |
| LOAN_CLOSING_DATE | 192609 | 2018-10-24 21:45:09.910751744 | 2000-08-28 00:00:00 | 2017-01-30 00:00:00 | 2020-01-27 00:00:00 | 2021-09-20 00:00:00 | 2023-03-31 00:00:00 | NaN |
| WORST_DELAY_DATE | 84658 | 2019-08-11 02:14:42.969123072 | 1998-10-10 00:00:00 | 2018-04-30 00:00:00 | 2021-05-31 00:00:00 | 2022-04-20 00:00:00 | 2023-04-12 00:00:00 | NaN |
| REPORT_DATE | 287267 | 2020-03-22 05:43:32.962853120 | 2001-04-25 00:00:00 | 2019-02-28 00:00:00 | 2021-11-10 00:00:00 | 2022-06-12 00:00:00 | 2023-05-07 00:00:00 | NaN |
| LAST_PURCHASE_DATE | 283214 | 2019-02-17 08:26:14.566229248 | 1950-01-01 00:00:00 | 2017-07-29 00:00:00 | 2020-08-03 00:00:00 | 2021-11-16 00:00:00 | 2023-04-07 00:00:00 | NaN |
| LAST_PAYMENT_DATE | 255679 | 2019-04-09 00:51:00.575174144 | 1950-01-01 00:00:00 | 2017-09-02 00:00:00 | 2020-09-06 00:00:00 | 2021-12-27 00:00:00 | 2023-05-07 00:00:00 | NaN |
| TOTAL_PAYMENTS | 268711.0 | 22.773876 | 0.0 | 1.0 | 3.0 | 16.0 | 1800.0 | 89.005749 |
| WORST_DELAY | 284146.0 | 4.27879 | 0.0 | 0.0 | 0.0 | 2.0 | 84.0 | 12.231546 |
| CURRENT_BALANCE | 287267.0 | 4578.482927 | 0.0 | 0.0 | 0.0 | 273.0 | 3469743.0 | 35543.251719 |
| BALANCE_DUE | 287267.0 | 2090.320639 | 0.0 | 0.0 | 0.0 | 0.0 | 1800000.0 | 14111.862995 |
| BALANCE_DUE_WORST_DELAY | 287267.0 | 1671.461922 | 0.0 | 0.0 | 0.0 | 159.0 | 1800000.0 | 11646.107604 |
1. Customer Behavior and Risk
Max Days Late:
The average max days late is around 12.78 days. However, the 75th percentile is at 15 days, and the maximum is at 70 days.
There's a segment of customers who are significantly late in making their payments. Identify the characteristics of these high-risk customers for targeted interventions.
Inquiries Last 3 Months (n_inquiries_l3m):
Total Payments (TOTAL_PAYMENTS):
Current Balance (CURRENT_BALANCE):
2. Portofolio Healthk
Account to Application Days (account_to_application_days):
Payment Amount (PAYMENT_AMOUNT):
from ML_functions.util_functions import inspection
# --- Inspect the Dataframe in order to determine cardinality, missing values, encoding strategy, imputing and transforming variables.
inspection(bankaya_merged)
==== Number of labeles: Cardinality ==== ACCOUNT_TYPE contains 7 labels ACC_CREATION_DATETIME contains 9236 labels APPLICATION_DATETIME contains 368 labels BUSINESS_TYPE contains 59 labels CDC_INQUIRY_ID contains 9173 labels CREDIT_TYPE contains 34 labels CURRENCY contains 4 labels CURRENT_PAYMENT contains 87 labels DELAYED_PAYMENTS contains 98 labels INQUIRY_DATE contains 626 labels INQUIRY_TIME contains 9261 labels LAST_PAYMENT_DATE contains 7269 labels LAST_PURCHASE_DATE contains 7711 labels LOAN_CLOSING_DATE contains 7168 labels LOAN_OPENING_DATE contains 7944 labels LOAN_ORIGINATION_DATETIME contains 9005 labels PAYMENT_FREQUENCY contains 12 labels PREVENTION_KEY contains 32 labels REPORT_DATE contains 5814 labels RESPONSABILITY_TYPE contains 6 labels TOTAL_REPORTED_PAYMENTS contains 2 labels UPDATE_DATE contains 5822 labels WORST_DELAY_DATE contains 5692 labels first_app_date contains 682 labels first_bnpl_app_date contains 8507 labels last_app_date contains 684 labels last_bnpl_app_date contains 8507 labels ==== Types of the variables we are working with ==== customer_id int64 loan_id int64 ACC_CREATION_DATETIME datetime64[ns] APPLICATION_DATETIME datetime64[ns] LOAN_ORIGINATION_DATETIME datetime64[ns] max_days_late int64 target int64 account_to_application_days int64 n_sf_apps float64 first_app_date datetime64[ns] last_app_date datetime64[ns] n_bnpl_apps float64 n_bnpl_approved_apps float64 first_bnpl_app_date datetime64[ns] last_bnpl_app_date datetime64[ns] n_inquiries_l3m float64 n_inquiries_l6m float64 INQUIRY_TIME datetime64[ns, UTC] CDC_INQUIRY_ID object INQUIRY_DATE datetime64[ns] PREVENTION_KEY object CURRENCY object MAX_CREDIT float64 CREDIT_LIMIT float64 PAYMENT_AMOUNT float64 UPDATE_DATE datetime64[ns] LOAN_OPENING_DATE datetime64[ns] LOAN_CLOSING_DATE datetime64[ns] WORST_DELAY_DATE datetime64[ns] REPORT_DATE datetime64[ns] LAST_PURCHASE_DATE datetime64[ns] LAST_PAYMENT_DATE datetime64[ns] PAYMENT_FREQUENCY object BUSINESS_TYPE object CREDIT_TYPE object ACCOUNT_TYPE object RESPONSABILITY_TYPE object TOTAL_PAYMENTS float64 DELAYED_PAYMENTS object CURRENT_PAYMENT object WORST_DELAY float64 TOTAL_REPORTED_PAYMENTS object CURRENT_BALANCE float64 BALANCE_DUE float64 BALANCE_DUE_WORST_DELAY float64 dtype: object Total Samples with missing values: 267458 ====Total percentage of missing values per variable ==== WORST_DELAY_DATE 70.538983 LOAN_CLOSING_DATE 32.971993 last_app_date 29.469369 n_sf_apps 29.469369 first_app_date 29.469369 TOTAL_REPORTED_PAYMENTS 14.595484 LAST_PAYMENT_DATE 11.023608 first_bnpl_app_date 7.033784 n_bnpl_approved_apps 7.033784 n_bnpl_apps 7.033784 last_bnpl_app_date 7.033784 TOTAL_PAYMENTS 6.488467 CREDIT_LIMIT 2.908239 LAST_PURCHASE_DATE 1.441418 WORST_DELAY 1.117081 n_inquiries_l3m 0.591601 n_inquiries_l6m 0.591601 MAX_CREDIT 0.063336 PAYMENT_FREQUENCY 0.030972 REPORT_DATE 0.030972 BALANCE_DUE 0.030972 DELAYED_PAYMENTS 0.030972 BUSINESS_TYPE 0.030972 CREDIT_TYPE 0.030972 CURRENT_PAYMENT 0.030972 RESPONSABILITY_TYPE 0.030972 CURRENT_BALANCE 0.030972 ACCOUNT_TYPE 0.030972 BALANCE_DUE_WORST_DELAY 0.030972 LOAN_OPENING_DATE 0.030972 UPDATE_DATE 0.030972 PAYMENT_AMOUNT 0.030972 CURRENCY 0.030972 PREVENTION_KEY 0.030972 INQUIRY_DATE 0.030972 CDC_INQUIRY_ID 0.030972 loan_id 0.000000 INQUIRY_TIME 0.000000 account_to_application_days 0.000000 target 0.000000 max_days_late 0.000000 LOAN_ORIGINATION_DATETIME 0.000000 APPLICATION_DATETIME 0.000000 ACC_CREATION_DATETIME 0.000000 customer_id 0.000000 dtype: float64 ===== Removed 2119 duplicate rows.======
# --- Replace None with the median value of the column
bankaya_merged['DELAYED_PAYMENTS'].fillna(bankaya_merged['DELAYED_PAYMENTS'].median(), inplace=True)
# --- convert the column to a int.
bankaya_merged['DELAYED_PAYMENTS'] = bankaya_merged['DELAYED_PAYMENTS'].astype('int')
Strategy for Encoding Categorical Variables
Encoding categorical variables is crucial for preparing the data for the model. Below is the strategy for encoding each categorical variable based on the number of unique values:
Data Cleaning
Encoding Variables
Note: Some variables like customer_id and loan_id may not be useful for modeling and can be dropped.
I found and removed 2119 duplicate rows from the dataset. Duplicate entries can lead to skewed analysis and incorrect conclusions. Therefore, it's crucial to maintain a clean and unique dataset for accurate business decision-making.
# --- Duplicate the Ddatframe to do the EDA
bankaya_eda = bankaya_merged.copy()
#bankaya_merged.head(5000).to_csv('/Users/iouriolguin/Documents/test.csv')
#bankaya_merged.to_csv('/Users/iouriolguin/Documents/bankaya_merged.csv')
#bankaya_merged.to_parquet('/Users/iouriolguin/Documents/bankaya_merged.parquet')
# --- Identify all numerical variables in the dataset for the EDA analysis
numerical_vars = bankaya_eda.select_dtypes(include=['number']).columns.tolist()
numerical_vars = [var for var in numerical_vars if var not in ['customer_id', 'loan_id']] #---> remove id variables
# --- Identify all categorical variables in the dataset
categorical_vars = bankaya_eda.select_dtypes(include=['object']).columns.tolist()
categorical_vars.remove("CDC_INQUIRY_ID")
# --- Identify the date columns
date_vars = [col for col in bankaya_eda.columns if 'date' in col.lower() or 'datetime' in col.lower()]
# --- Univariate Analysis of the target variable
# Plotting the distribution of 'outcome_type'
fig = px.pie(bankaya_eda["target"], names='target', title='Distribution of Default')
fig.show()
# --- Create a histogram for certain variables
lst_histogram = ["max_days_late","account_to_application_days","MAX_CREDIT","TOTAL_PAYMENTS",'DELAYED_PAYMENTS','n_bnpl_apps','n_inquiries_l3m']
for col in lst_histogram:
plot_histogram(bankaya_eda,col)
max_days_late: This variable mostly has low values, indicating that most customers are not significantly late on their payments.account_to_application_days: The distribution shows that most customers apply for a loan shortly after creating an account.TOTAL_PAYMENTS: This variable mostly takes on low values, suggesting that most customers have made few total payments.MAX_CREDIT: The majority of values for "MAX_CREDIT" are concentrated around a specific range, with a sharp peak evident in the histogram. This suggests that a large number of customers or instances have a credit limit (or maximum credit value) within this specific range.# Group by the application date and count the number of applications
application_time_series = bankaya_eda.groupby('APPLICATION_DATETIME').size().reset_index(name='Number_of_Applications')
# --- Time Series Plot for the most relevant KPIs
plot_time_series(application_time_series,'Number_of_Applications','APPLICATION_DATETIME','D','mean')
plot_time_series(bankaya_eda,'account_to_application_days','ACC_CREATION_DATETIME','D','sum')
plot_time_series(bankaya_eda,'n_sf_apps','LOAN_ORIGINATION_DATETIME','M','mean')
plot_time_series(bankaya_eda,'n_bnpl_apps','first_bnpl_app_date','M','mean')
plot_time_series(bankaya_eda,'n_inquiries_l3m','INQUIRY_DATE','M','mean')
plot_time_series(bankaya_eda,'TOTAL_PAYMENTS','LOAN_ORIGINATION_DATETIME','D','mean')
In order to plot the Time Series I use a Relevance to Objective strategy. The date column should be closely related to what I am trying to analyze. By doing this the date columns should aligned witht the objective or question to answer.
APPLICATION_DATETIME
# --- Read the json file created in the prject in order to map and group all the Business types
# to reduce the cardinality in this variable
# Reading JSON from a file
with open("input/businees_type_catalog.json", 'r') as f:
business_catalog_dict = json.load(f)
# Reverse the mapping to make it usable in `replace`
reverse_mapping = {k: old_k for old_k, old_v in business_catalog_dict.items() for k in old_v}
# Update column in the dataframe with the business categories
bankaya_eda['BUSINESS_TYPE'] = bankaya_eda['BUSINESS_TYPE'].replace(reverse_mapping)
# --- Pie Charts for Categorical Variables
# ---Remove the date features from the category variables
filtered_categorical_vars = list(set(categorical_vars) - set(date_vars))
filtered_categorical_vars = list(set(filtered_categorical_vars) - set(["CREDIT_TYPE","BUSINESS_TYPE","CURRENT_PAYMENT","PREVENTION_KEY"]))
# --- Plot relevant category variables
for var in filtered_categorical_vars:
plot_pie(bankaya_eda,var)
from ML_functions.plots import cross_values_plot
cross_values_plot(bankaya_eda,"max_days_late","BUSINESS_TYPE","ACCOUNT_TYPE",temp_col_name="days_late")
# --- Investigate the average max_days_late for different 'account_to_application_days' bins
bins = [0, 7, 30, 90, 180, 365, float('inf')]
labels = ['0-7 days', '8-30 days', '31-90 days', '91-180 days', '181-365 days', 'Over a year']
bankaya_eda['Application_Timeframe'] = pd.cut(bankaya_eda['account_to_application_days'], bins=bins, labels=labels)
# --- Calculate the average max_days_late for each bin
avg_max_days_late = bankaya_eda.groupby('Application_Timeframe')['max_days_late'].mean().reset_index()
# -- Plot the results
fig = px.bar(avg_max_days_late, x='Application_Timeframe', y='max_days_late', title='Average max_days_late by Application Timeframe')
fig.show()
Investigating the average max_days_late for people who apply for loans within 'X' days of creating an account until the apply for a loan.
# Calculate the average max_days_late for each RESPONSABILITY_TYPE
avg_max_days_late_by_responsability = bankaya_eda.groupby('RESPONSABILITY_TYPE')['max_days_late'].mean().reset_index()
# Plot the results
fig = px.bar(avg_max_days_late_by_responsability, x='RESPONSABILITY_TYPE', y='max_days_late', title='Average max_days_late by RESPONSABILITY_TYPE')
fig.show()
We could check if different responsibility types have different average max_days_late.
TITULAR CON AVAL: This category appears to represent primary borrowers with a guarantee or backing (AVAL). Interestingly, this category has the lowest average max_days_late among all categories, suggesting that having both a primary borrower and a guarantee may be a safer lending scenario.
MANCOMUNADO: This likely represents joint responsibility or co-signing of an account or debt. It's noticeable that this category has a similarly high average max_days_late as the "AVAL" category, suggesting co-borrowers might have equivalent risk levels as those who guarantee another's debt.
It might be beneficial to reassess the risk associated with the "AVAL" and "MANCOMUNADO" categories, given their high average max_days_late. This can help in refining lending strategies or terms for such categories.
from ML_functions.plots import plot_box, plot_cross_data
# Plot the distribution of CURRENT_BALANCE for different ACCOUNT_TYPEs
plot_box(bankaya_eda,'ACCOUNT_TYPE','max_days_late','Distribution of CURRENT_BALANCE by ACCOUNT_TYPE')
This could help in designing account features that help people maintain low balances.
This account type has a wider IQR, which implies more variability in payment punctuality. This could indicate a segment that either forgets or struggles to make fixed payments on time.
Given the outliers across all account types, it would be beneficial to engage these high-risk groups with financial counseling, automated reminders, or even potential restructuring of payment terms.
# --- Read the json file created in the prject in order to map and group all the Credit types
# to reduce the cardinality in this variable
# Reading JSON from a file
with open("input/credit_type.json", 'r') as f:
credit_catalog_dict = json.load(f)
# Reverse the mapping to make it usable in `replace`
reverse_mapping = {k: old_k for old_k, old_v in credit_catalog_dict.items() for k in old_v}
# Update column in the dataframe with the business categories
bankaya_eda['sub_credit_type'] = bankaya_eda['CREDIT_TYPE'].replace(reverse_mapping)
plot_cross_data(bankaya_eda,"max_days_late","sub_credit_type","ACCOUNT_TYPE","days later","mean")
By understanding which combinations of account and sub_credit types have higher days late on average, businesses can prioritize collections efforts on high-risk segments.
# --- Plot matrix correlation
matrix_correlation(bankaya_eda[numerical_vars],cols_to_exclude=None)
max_days_late: This suggests that as the maximum number of days a payment is late increases, the target value also tends to increase. This could indicate that this variable is a significant predictor for the target.n_inquiries_3m and n_inquiries_6m: These two variables show a strong positive correlation of 0.69, which implies that they might be conveying similar information. If we want to avoid multicollinearity by using just one of them in the model where this might be an issue.PAYMENT_AMOUNT and BALANCE_DUE have a high positive correlation of 0.79, suggesting that larger payment amounts are associated with higher due.from ML_functions.util_functions import remove_highly_correlated_features
# --- Select the high correlated varoiables under a 0.7 threshold
high_correlated_lst = remove_highly_correlated_features(bankaya_eda[numerical_vars],threshold=0.7,cols_to_exclude=["target"])
high_correlated_lst
=== High correlated features to be removed from the Dataset====
['WORST_DELAY', 'BALANCE_DUE', 'BALANCE_DUE_WORST_DELAY']
# --- Select the the features to be used by removing high correlated variables
model_current_features = list(set(bankaya_merged.columns.to_list())- set(high_correlated_lst))
model_current_features = [var for var in model_current_features if 'id' not in var.lower() or "customer_id" in var]
# --- Create a new dataframe in order that not duplicate under the dataset
bankaya_model_df = bankaya_merged.copy()
bankaya_model_df = bankaya_model_df[model_current_features]
# --- Identify all numerical variables in the dataset for the modeling pprocesing.
model_numerical_vars = bankaya_model_df.select_dtypes(include=['number']).columns.tolist()
# --- Identify all categorical variables in the dataset
model_categorical_vars = bankaya_model_df.select_dtypes(include=['object']).columns.tolist()
# --- Identify the date columns
model_date_vars = [col for col in bankaya_model_df.columns if 'date' in col.lower() or 'datetime' in col.lower()]
# === Diagnostic outliers ===
for var in bankaya_model_df[model_numerical_vars]:
if 'target' not in var and "customer_id" not in var:
plot_diagnostic_outlier(bankaya_model_df[model_numerical_vars],var)
DELAYED_PAYMENTS The plot on the left appears to be a density plot (or Kernel Density Estimation) of the DELAYED_PAYMENTS variable. It shows that a vast majority of customers have very few delayed payments (close to 0). This suggests that a significant proportion of Bankaya's clientele is punctual with their payments.account to application A significant number of customers seem to apply for credit very shortly after creating their accounts, as indicated by the peak near 0 days. As the number of days increases, the density decreases, indicating fewer customers apply for credit long after opening their accounts. n_inquiries_l6m There are clear outliers depicted on the box plot, with customers making an unusually high number of inquiries compared to the majority. These outliers range roughly from 50 inquiries up to almost 200.n_sf_apps The box plot distinctly portrays several outliers, with individuals submitting an unusually high number of special finance applications compared to the bulk of the dataset. These outliers range from roughly 3 applications upwards, with a few cases extending beyond 30.max_days_late The box plot reveals numerous outliers. These individuals have a record of being significantly late on their payments, extending well beyond the typical range observed in the dataset. Some individuals have delays exceeding 40 days.Customers that wait a significantly longer period before applying for credit might have different financial behaviors or needs. It would be beneficial for the bank to study these outliers separately. Understanding the timing of credit applications relative to account creation can provide valuable insights for optimizing the bank's credit approval process, marketing strategies, and customer service offerings.
It's crucial to understand the nature of these outliers before making any business decisions. They could either be anomalies that need to be corrected or valuable insights that could drive specific business strategies. After understanding the business we can apply the Z-score method to handle outliers
payment_to_due_ratio: This feature would represent how much of the due amount has been paid. It could be useful for understanding the payment behavior of an individual or entity. CURRENT_BALANCE and PAYMENT_AMOUNT. Time to Loan Origination: It represents the duration in days between the application date (APPLICATION_DATETIME) and the loan origination date (LOAN_ORIGINATION_DATETIME). Days between Last Payment and Last Purchase: Difference between LAST_PAYMENT_DATE and LAST_PURCHASE_DATE. A longer time might indicate financial distress or irregular payment behavior.Delayed to Total Payments Ratio: Ratio of DELAYED_PAYMENTS to TOTAL_PAYMENTS. A higher ratio indicates a pattern of delayed payments, signaling a risky customer.Number of Unique Credit Types by Customer: Count of unique CREDIT_TYPE for each customer_id. Multiple types of credit could either indicate financial savviness or over-leveraging.Outstanding Balance: CREDIT_LIMIT - CURRENT_BALANCE. A higher outstanding balance relative to the credit limit could indicate financial distress.Loan Life: Difference between LOAN_CLOSING_DATE and LOAN_OPENING_DATE. The customer paid off the loan quickly, possibly indicating financial stability and low risk or the loan had to be closed quickly due to default or restructuring, indicating high risk.Customer Lifetime: Understanding customer tenure can help in segmenting the customer base for targeted marketing. Customer tenure segmentation can be an effective way to target different customer groups with tailored strategies. By segmenting my user I will be able to segment in a proper way my customers by doing a RFM (Recency, Frequency, Monetary) analysis used to calculate CLV and Customers with low CLV and high max_days_late could be considered high-risk.
For the prevention key feature I will use a weighted feature technique in order to give a weight to each category depending of the risk and then group the dataset by customer id. By doing this I will ensure the quality of the data.
For the Busniess type and Credit type I cleaned the categories and grouped them in order to avoid the dimensionality of the datset. I will create OHE for the general categories created for both variables. Regarding the categories from the dataset I will usem them to count how many unique busniess type and credit type the user has. The will be number busniess type and number credit type.
# --- Calculate the payment_to_due_ratio. This means how much of the current balance is covered by the payment amount.
bankaya_model_df["payment_to_due_ratio"] = np.where(
(bankaya_model_df["CURRENT_BALANCE"] == 0) & (bankaya_model_df["PAYMENT_AMOUNT"] == 0),
0, round(bankaya_model_df["CURRENT_BALANCE"] / bankaya_model_df["PAYMENT_AMOUNT"],2))
# # --- Time to loan origination. It represents the duration in days between the application date and the
# loan origination .
bankaya_model_df["time_to_loan_days"] = -(bankaya_model_df["APPLICATION_DATETIME"] - bankaya_model_df["LOAN_ORIGINATION_DATETIME"]).dt.days
# --- Last payment days. A longer time might indicate financial distress or irregular payment behavior.
bankaya_model_df["last_payment_days"]= abs(bankaya_model_df["LAST_PAYMENT_DATE"] - bankaya_model_df["LAST_PURCHASE_DATE"]).dt.days
# # Number of unique credit types by customer. Multiple types of credit could either indicate financial savviness or over-leveraging.
# #bankaya_model_df["n_credit_types"] = bankaya_model_df[]
# --- Outstanding balance. A higher outstanding balance relative to the credit limit could indicate financial distress.
bankaya_model_df["outstanding_balance"] = abs(bankaya_model_df["CREDIT_LIMIT"] - bankaya_model_df["CURRENT_BALANCE"])
# --- Loan life. The customer paid off the loan quickly, possibly indicating financial stability and low risk or
# the loan had to be closed quickly due to default or restructuring, indicating high risk.
bankaya_model_df["loan_life_days"] = (bankaya_model_df["LOAN_CLOSING_DATE"] - bankaya_model_df["LOAN_OPENING_DATE"]).dt.days
# --- Read the json file created in the project in order to map and group all the Business types
# to reduce the cardinality in this variable
# Reading JSON from a file
with open("input/businees_type_catalog.json", 'r') as f:
business_catalog_dict = json.load(f)
# Reverse the mapping to make it usable in `replace`
reverse_mapping = {k: old_k for old_k, old_v in business_catalog_dict.items() for k in old_v}
# Update column in the dataframe with the business categories
bankaya_model_df["gen_business_type"] = bankaya_model_df['BUSINESS_TYPE'].replace(reverse_mapping)
# --- Read the json file created in the prject in order to map and group all the Credit types
# to reduce the cardinality in this variable
# Reading JSON from a file
with open("input/credit_type.json", 'r') as f:
credit_catalog_dict = json.load(f)
# Reverse the mapping to make it usable in `replace`
reverse_mapping = {k: old_k for old_k, old_v in credit_catalog_dict.items() for k in old_v}
# Update column in the dataframe with the business categories
bankaya_model_df['gen_credit_type'] = bankaya_model_df['CREDIT_TYPE'].replace(reverse_mapping)
# --- Reading JSON file from the prevention key scores
with open("input/prevention_key_scores.json","r") as f:
risk_scores= json.load(f)
# --- Current features after the feature engenieering processing
cols_current_features =bankaya_model_df.columns.to_list()
# --- Remove unnecesary variables or varibales which weres used before.
model_cols_to_remove = ["CURRENT_PAYMENT","TOTAL_REPORTED_PAYMENTS","CURRENT_PAYMENT",
"CREDIT_TYPE","BUSINESS_TYPE",'INQUIRY_TIME'] #+ model_date_vars
# --- Create the final list of variables to use for the model
model_remaining_features = list(set(cols_current_features) - set(model_cols_to_remove))
# --- Create the final datset which it will be grouped later to use for the model
bankaya_final_model_df = bankaya_model_df[model_remaining_features]
# ---- Lowercase all the columns in the datset
bankaya_final_model_df.columns = [col.lower() for col in bankaya_final_model_df.columns]
print(f"dataframe shape: {bankaya_final_model_df.shape}")
bankaya_final_model_df.head(5)
dataframe shape: (285237, 42)
| delayed_payments | max_credit | time_to_loan_days | gen_credit_type | credit_limit | update_date | n_inquiries_l3m | prevention_key | current_balance | n_bnpl_apps | outstanding_balance | inquiry_date | last_app_date | n_sf_apps | loan_origination_datetime | last_payment_days | payment_to_due_ratio | worst_delay_date | account_type | first_app_date | payment_frequency | n_bnpl_approved_apps | payment_amount | account_to_application_days | first_bnpl_app_date | n_inquiries_l6m | acc_creation_datetime | loan_opening_date | total_payments | report_date | responsability_type | max_days_late | target | last_bnpl_app_date | loan_closing_date | currency | last_purchase_date | loan_life_days | customer_id | gen_business_type | application_datetime | last_payment_date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 3000.0 | 178 | Miscellaneous | 0.0 | 2017-04-19 | 91.0 | SIN OBSERVACION | 0.0 | 1.0 | 0.0 | 2022-04-25 | 2022-04-25 | 1.0 | 2022-10-20 18:26:35 | 1427.0 | 0.0 | NaT | SIN LIMITE PREESTABLECIDO | 2022-04-25 | MENSUAL | 0.0 | 0.0 | 245 | 2022-04-26 02:15:17.742 | 91.0 | 2021-08-23 13:57:56 | 2013-03-27 | NaN | 2017-02-28 | INDIVIDUAL (TITULAR) | 5 | 0 | 2022-04-26 02:15:17.742 | 2017-02-21 | PESOS MEXICANOS | 2017-02-21 | 1427.0 | 1223 | General Services | 2022-04-26 07:00:00 | 2013-03-27 |
| 1 | 0 | 500.0 | 178 | Personal Loans | 2200.0 | 2021-06-21 | 91.0 | SIN OBSERVACION | 0.0 | 1.0 | 2200.0 | 2022-04-25 | 2022-04-25 | 1.0 | 2022-10-20 18:26:35 | 15.0 | 0.0 | NaT | PAGOS FIJOS | 2022-04-25 | MENSUAL | 0.0 | 0.0 | 245 | 2022-04-26 02:15:17.742 | 91.0 | 2021-08-23 13:57:56 | 2021-04-03 | 1.0 | 2021-06-21 | INDIVIDUAL (TITULAR) | 5 | 0 | 2022-04-26 02:15:17.742 | 2021-04-18 | PESOS MEXICANOS | 2021-04-03 | 15.0 | 1223 | Financial Services | 2022-04-26 07:00:00 | 2021-04-18 |
| 2 | 0 | 1100.0 | 178 | Personal Loans | 2200.0 | 2021-06-21 | 91.0 | SIN OBSERVACION | 0.0 | 1.0 | 2200.0 | 2022-04-25 | 2022-04-25 | 1.0 | 2022-10-20 18:26:35 | 4.0 | 0.0 | NaT | PAGOS FIJOS | 2022-04-25 | MENSUAL | 0.0 | 0.0 | 245 | 2022-04-26 02:15:17.742 | 91.0 | 2021-08-23 13:57:56 | 2021-04-18 | 1.0 | 2021-06-21 | INDIVIDUAL (TITULAR) | 5 | 0 | 2022-04-26 02:15:17.742 | 2021-04-22 | PESOS MEXICANOS | 2021-04-18 | 4.0 | 1223 | Financial Services | 2022-04-26 07:00:00 | 2021-04-22 |
| 3 | 0 | 1700.0 | 178 | Personal Loans | 2500.0 | 2021-07-13 | 91.0 | SIN OBSERVACION | 0.0 | 1.0 | 2500.0 | 2022-04-25 | 2022-04-25 | 1.0 | 2022-10-20 18:26:35 | 11.0 | 0.0 | NaT | PAGOS FIJOS | 2022-04-25 | MENSUAL | 0.0 | 0.0 | 245 | 2022-04-26 02:15:17.742 | 91.0 | 2021-08-23 13:57:56 | 2021-04-22 | 1.0 | 2021-07-13 | INDIVIDUAL (TITULAR) | 5 | 0 | 2022-04-26 02:15:17.742 | 2021-05-03 | PESOS MEXICANOS | 2021-04-22 | 11.0 | 1223 | Financial Services | 2022-04-26 07:00:00 | 2021-05-03 |
| 4 | 0 | 1800.0 | 178 | Personal Loans | 2500.0 | 2021-07-13 | 91.0 | SIN OBSERVACION | 0.0 | 1.0 | 2500.0 | 2022-04-25 | 2022-04-25 | 1.0 | 2022-10-20 18:26:35 | 23.0 | 0.0 | NaT | PAGOS FIJOS | 2022-04-25 | MENSUAL | 0.0 | 0.0 | 245 | 2022-04-26 02:15:17.742 | 91.0 | 2021-08-23 13:57:56 | 2021-05-03 | 1.0 | 2021-07-13 | INDIVIDUAL (TITULAR) | 5 | 0 | 2022-04-26 02:15:17.742 | 2021-05-26 | PESOS MEXICANOS | 2021-05-03 | 23.0 | 1223 | Financial Services | 2022-04-26 07:00:00 | 2021-05-26 |
# --- Identify all categorical variables after including the new categories for the Business Type
model_categorical_vars = bankaya_final_model_df.select_dtypes(include=['object']).columns.tolist()
# --- Identify all numerical variables in the dataset for the modeling pprocesing.
model_numerical_vars = bankaya_final_model_df.select_dtypes(include=['number']).columns.tolist()
# --- Select the OHE variables to be transformed
model_ohe_vars = ["account_type","currency","responsability_type"]
# --- Select the TE variables to be transformed
model_te_vars = ["gen_credit_type","gen_business_type","payment_frequency"]
print(f"Total variables so far: {len(model_categorical_vars)+len(model_numerical_vars)}")
Total variables so far: 27
# --- Create dummie variables or OHE for the categorical variables
bankaya_model_df_ohe = pd.get_dummies(bankaya_final_model_df, columns=model_ohe_vars)
# Show the new columns created as a result of One-Hot Encoding to verify
new_columns = [col for col in bankaya_model_df_ohe.columns if any(col.startswith(prefix) for prefix in model_ohe_vars)]
#new_columns[:10] # Displaying only the first 10 for brevity
# Initialize the target encoder
target_encoder = ce.TargetEncoder(cols=model_te_vars)
# Fit and transform to get the target encoded columns
bankaya_final_model_df_encoded = target_encoder.fit_transform(
bankaya_model_df_ohe[model_te_vars],
bankaya_model_df_ohe['target']
)
# Combine the encoded columns with the rest of your dataframe (excluding original categorical columns)
bankaya_final_model_df = pd.concat(
[
bankaya_model_df_ohe.drop(model_te_vars, axis=1),
bankaya_final_model_df_encoded
],
axis=1
)
# --- Create a dictionary for the aggregation to be done.
aggregations = {}
# --- Loop through each column to decide the aggregation method
for col in bankaya_final_model_df.columns:
if col == 'customer_id':
continue # Skip the customer_id column
if bankaya_final_model_df[col].dtype == 'object':
aggregations[col] = ['nunique'] # Number of unique categories for categorical features
elif 'datetime' in str(bankaya_final_model_df[col].dtype):
aggregations[col] = ['min', 'max'] # Min and Max for datetime features
else:
aggregations[col] = ['mean', 'sum', 'count'] # Mean, Sum, and Count for numerical features
# --- Mapping the prevention key with the scores given.
bankaya_final_model_df['prevention_key_risk_score'] = bankaya_final_model_df['prevention_key'].map(risk_scores)
# --- Update the aggregations dictionary to include the new risk score column
aggregations['prevention_key_risk_score'] = ['mean', 'sum', 'count']
# Group by 'customer_id' and aggregate using the defined aggregations
grouped_df = bankaya_final_model_df.groupby('customer_id').agg(aggregations).reset_index()
# Flatten the MultiIndex in columns
grouped_df.columns = ['_'.join(col).strip() for col in grouped_df.columns.values]
# Rename the 'customer_id_' column to 'customer_id'
grouped_df.rename(columns={'customer_id_': 'customer_id'}, inplace=True)
# Display the first few rows of the grouped dataframe
grouped_df
#acc_creation_datetime_min - # --- Create a dictionary for the aggregation to be done.
aggregations = {}
# --- Loop through each column to decide the aggregation method
for col in bankaya_final_model_df.columns:
if col == 'customer_id':
continue # Skip the customer_id column
if bankaya_final_model_df[col].dtype == 'object':
aggregations[col] = ['nunique'] # Number of unique categories for categorical features
elif 'datetime' in str(bankaya_final_model_df[col].dtype):
aggregations[col] = ['min', 'max'] # Min and Max for datetime features
else:
aggregations[col] = ['mean', 'sum', 'count'] # Mean, Sum, and Count for numerical features
# --- Mapping the prevention key with the scores given.
bankaya_final_model_df['prevention_key_risk_score'] = bankaya_final_model_df['prevention_key'].map(risk_scores)
# --- Update the aggregations dictionary to include the new risk score column
aggregations['prevention_key_risk_score'] = ['mean', 'sum', 'count']
# Group by 'customer_id' and aggregate using the defined aggregations
grouped_df = bankaya_final_model_df.groupby('customer_id').agg(aggregations).reset_index()
# Flatten the MultiIndex in columns
grouped_df.columns = ['_'.join(col).strip() for col in grouped_df.columns.values]
# Rename the 'customer_id_' column to 'customer_id'
grouped_df.rename(columns={'customer_id_': 'customer_id'}, inplace=True)
# Display the first few rows of the grouped dataframe
grouped_df
#acc_creation_datetime_min - # --- Create a dictionary for the aggregation to be done.
aggregations = {}
# --- Loop through each column to decide the aggregation method
for col in bankaya_final_model_df.columns:
if col == 'customer_id':
continue # Skip the customer_id column
if bankaya_final_model_df[col].dtype == 'object':
aggregations[col] = ['nunique'] # Number of unique categories for categorical features
elif 'datetime' in str(bankaya_final_model_df[col].dtype):
aggregations[col] = ['min', 'max'] # Min and Max for datetime features
else:
aggregations[col] = ['mean', 'sum', 'count'] # Mean, Sum, and Count for numerical features
# --- Mapping the prevention key with the scores given.
bankaya_final_model_df['prevention_key_risk_score'] = bankaya_final_model_df['prevention_key'].map(risk_scores)
# --- Update the aggregations dictionary to include the new risk score column
aggregations['prevention_key_risk_score'] = ['mean', 'sum', 'count']
# Group by 'customer_id' and aggregate using the defined aggregations
grouped_df = bankaya_final_model_df.groupby('customer_id').agg(aggregations).reset_index()
# Flatten the MultiIndex in columns
grouped_df.columns = ['_'.join(col).strip() for col in grouped_df.columns.values]
# Rename the 'customer_id_' column to 'customer_id'
grouped_df.rename(columns={'customer_id_': 'customer_id'}, inplace=True)
# --- Create customer lifetime feature
grouped_df['customer_lifetime'] = abs(grouped_df["acc_creation_datetime_min"] - grouped_df["application_datetime_max"]).dt.days
# Display the first few rows of the grouped dataframe
grouped_df
# customer_lifespin = acc_creation_datetime_min - application_datetime_max
| customer_id | delayed_payments_mean | delayed_payments_sum | delayed_payments_count | max_credit_mean | max_credit_sum | max_credit_count | time_to_loan_days_mean | time_to_loan_days_sum | time_to_loan_days_count | credit_limit_mean | credit_limit_sum | credit_limit_count | update_date_min | update_date_max | n_inquiries_l3m_mean | n_inquiries_l3m_sum | n_inquiries_l3m_count | prevention_key_nunique | current_balance_mean | current_balance_sum | current_balance_count | n_bnpl_apps_mean | n_bnpl_apps_sum | n_bnpl_apps_count | outstanding_balance_mean | outstanding_balance_sum | outstanding_balance_count | inquiry_date_min | inquiry_date_max | last_app_date_min | last_app_date_max | n_sf_apps_mean | n_sf_apps_sum | n_sf_apps_count | loan_origination_datetime_min | loan_origination_datetime_max | last_payment_days_mean | last_payment_days_sum | last_payment_days_count | payment_to_due_ratio_mean | payment_to_due_ratio_sum | payment_to_due_ratio_count | worst_delay_date_min | worst_delay_date_max | first_app_date_min | first_app_date_max | n_bnpl_approved_apps_mean | n_bnpl_approved_apps_sum | n_bnpl_approved_apps_count | payment_amount_mean | payment_amount_sum | payment_amount_count | account_to_application_days_mean | account_to_application_days_sum | account_to_application_days_count | first_bnpl_app_date_min | first_bnpl_app_date_max | n_inquiries_l6m_mean | n_inquiries_l6m_sum | n_inquiries_l6m_count | acc_creation_datetime_min | acc_creation_datetime_max | loan_opening_date_min | loan_opening_date_max | total_payments_mean | total_payments_sum | total_payments_count | report_date_min | report_date_max | max_days_late_mean | max_days_late_sum | max_days_late_count | target_mean | target_sum | target_count | last_bnpl_app_date_min | last_bnpl_app_date_max | loan_closing_date_min | loan_closing_date_max | last_purchase_date_min | last_purchase_date_max | loan_life_days_mean | loan_life_days_sum | loan_life_days_count | application_datetime_min | application_datetime_max | last_payment_date_min | last_payment_date_max | account_type_CREDITO DE HABILITACION O AVIO_mean | account_type_CREDITO DE HABILITACION O AVIO_sum | account_type_CREDITO DE HABILITACION O AVIO_count | account_type_HIPOTECA_mean | account_type_HIPOTECA_sum | account_type_HIPOTECA_count | account_type_PAGOS FIJOS_mean | account_type_PAGOS FIJOS_sum | account_type_PAGOS FIJOS_count | account_type_POR DETERMINAR_mean | account_type_POR DETERMINAR_sum | account_type_POR DETERMINAR_count | account_type_REVOLVENTE_mean | account_type_REVOLVENTE_sum | account_type_REVOLVENTE_count | account_type_SIN LIMITE PREESTABLECIDO_mean | account_type_SIN LIMITE PREESTABLECIDO_sum | account_type_SIN LIMITE PREESTABLECIDO_count | currency_DOLARES_mean | currency_DOLARES_sum | currency_DOLARES_count | currency_PESOS MEXICANOS_mean | currency_PESOS MEXICANOS_sum | currency_PESOS MEXICANOS_count | currency_UNIDADES DE INVERSION_mean | currency_UNIDADES DE INVERSION_sum | currency_UNIDADES DE INVERSION_count | responsability_type_AVAL_mean | responsability_type_AVAL_sum | responsability_type_AVAL_count | responsability_type_INDIVIDUAL (TITULAR)_mean | responsability_type_INDIVIDUAL (TITULAR)_sum | responsability_type_INDIVIDUAL (TITULAR)_count | responsability_type_MANCOMUNADO_mean | responsability_type_MANCOMUNADO_sum | responsability_type_MANCOMUNADO_count | responsability_type_OBLIGATORIO SOLIDARIO_mean | responsability_type_OBLIGATORIO SOLIDARIO_sum | responsability_type_OBLIGATORIO SOLIDARIO_count | responsability_type_TITULAR CON AVAL_mean | responsability_type_TITULAR CON AVAL_sum | responsability_type_TITULAR CON AVAL_count | gen_credit_type_mean | gen_credit_type_sum | gen_credit_type_count | gen_business_type_mean | gen_business_type_sum | gen_business_type_count | payment_frequency_mean | payment_frequency_sum | payment_frequency_count | prevention_key_risk_score_mean | prevention_key_risk_score_sum | prevention_key_risk_score_count | customer_lifetime | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0.125000 | 1 | 8 | 21257.000000 | 170056.0 | 8 | 4.0 | 32 | 8 | 69080.000000 | 552640.0 | 8 | 2018-01-24 | 2022-06-11 | NaN | 0.0 | 0 | 2 | 2246.000000 | 17968.0 | 8 | 2.0 | 16.0 | 8 | 66834.000000 | 534672.0 | 8 | 2022-06-16 | 2022-06-16 | 2022-06-16 | 2022-06-16 | 1.0 | 8.0 | 8 | 2023-02-16 11:44:50 | 2023-02-16 11:44:50 | 76.250000 | 610.0 | 8 | 4.272500 | 34.18 | 8 | NaT | NaT | 2022-06-16 | 2022-06-16 | 2.0 | 16.0 | 8 | 202.625000 | 1621.0 | 8 | 851.0 | 6808 | 8 | 2022-10-27 19:15:56.544 | 2022-10-27 19:15:56.544 | NaN | 0.0 | 0 | 2020-10-14 18:22:10 | 2020-10-14 18:22:10 | 2003-08-21 | 2017-04-21 | 1.000000 | 1.0 | 1 | 2017-11-30 | 2022-05-31 | 0.0 | 0 | 8 | 0.0 | 0 | 8 | 2022-07-21 17:38:17.667 | 2022-07-21 17:38:17.667 | 2017-12-19 | 2018-04-05 | 2017-11-30 | 2022-05-31 | 1341.500000 | 2683.0 | 2 | 2023-02-13 08:00:00 | 2023-02-13 08:00:00 | 2017-11-28 | 2022-05-27 | 0.0 | 0 | 8 | 0.00000 | 0 | 8 | 0.000000 | 0 | 8 | 0.000000 | 0 | 8 | 0.750000 | 6 | 8 | 0.250000 | 2 | 8 | 0.0 | 0 | 8 | 1.0 | 8 | 8 | 0.0 | 0 | 8 | 0.0 | 0 | 8 | 1.000000 | 8 | 8 | 0.000000 | 0 | 8 | 0.000000 | 0 | 8 | 0.000000 | 0 | 8 | 0.178660 | 1.429281 | 8 | 0.173792 | 1.390340 | 8 | 0.162429 | 1.299428 | 8 | 1.250000 | 10.0 | 8 | 851 |
| 1 | 2 | 0.000000 | 0 | 10 | 15227.500000 | 152275.0 | 10 | 2.0 | 20 | 10 | 37566.666667 | 338100.0 | 9 | 2020-04-09 | 2022-04-25 | 0.0 | 0.0 | 10 | 4 | 2577.900000 | 25779.0 | 10 | 1.0 | 10.0 | 10 | 34702.333333 | 312321.0 | 9 | 2022-05-02 | 2022-05-02 | 2022-11-09 | 2022-11-09 | 8.0 | 80.0 | 10 | 2023-01-27 16:17:49 | 2023-01-27 16:17:49 | 242.500000 | 2425.0 | 10 | NaN | NaN | 10 | NaT | NaT | 2022-05-02 | 2022-05-02 | 1.0 | 10.0 | 10 | 168.900000 | 1689.0 | 10 | 828.0 | 8280 | 10 | 2022-07-01 16:50:01.030 | 2022-07-01 16:50:01.030 | 0.0 | 0.0 | 10 | 2020-10-19 12:11:43 | 2020-10-19 12:11:43 | 2009-08-15 | 2021-07-26 | 18.500000 | 37.0 | 2 | 2020-03-31 | 2022-04-19 | -1.0 | -10 | 10 | 0.0 | 0 | 10 | 2022-07-01 16:50:01.030 | 2022-07-01 16:50:01.030 | 2020-03-11 | 2022-01-21 | 2016-01-05 | 2022-04-14 | 1256.333333 | 7538.0 | 6 | 2023-01-26 08:00:00 | 2023-01-26 08:00:00 | 2019-11-26 | 2022-04-16 | 0.0 | 0 | 10 | 0.00000 | 0 | 10 | 0.100000 | 1 | 10 | 0.000000 | 0 | 10 | 0.900000 | 9 | 10 | 0.000000 | 0 | 10 | 0.0 | 0 | 10 | 1.0 | 10 | 10 | 0.0 | 0 | 10 | 0.0 | 0 | 10 | 1.000000 | 10 | 10 | 0.000000 | 0 | 10 | 0.000000 | 0 | 10 | 0.000000 | 0 | 10 | 0.183405 | 1.834049 | 10 | 0.165376 | 1.653760 | 10 | 0.162429 | 1.624285 | 10 | 2.000000 | 20.0 | 10 | 828 |
| 2 | 3 | 2.656250 | 85 | 32 | 14743.343750 | 471787.0 | 32 | 11.0 | 352 | 32 | 16751.250000 | 536040.0 | 32 | 2014-06-16 | 2022-10-12 | 0.0 | 0.0 | 32 | 4 | 14773.718750 | 472759.0 | 32 | 1.0 | 32.0 | 32 | 3392.906250 | 108573.0 | 32 | 2022-10-19 | 2022-10-19 | NaT | NaT | NaN | 0.0 | 0 | 2023-02-02 18:21:12 | 2023-02-02 18:21:12 | 91.269231 | 2373.0 | 26 | 4.142500 | 132.56 | 32 | 2021-12-31 | 2022-07-31 | NaT | NaT | 0.0 | 0.0 | 32 | 379.468750 | 12143.0 | 32 | 818.0 | 26176 | 32 | 2022-10-19 03:51:19.918 | 2022-10-19 03:51:19.918 | 31.0 | 992.0 | 32 | 2020-10-26 11:03:57 | 2020-10-26 11:03:57 | 2014-05-03 | 2021-12-06 | 12.812500 | 410.0 | 32 | 2014-06-16 | 2022-09-30 | 63.0 | 2016 | 32 | 1.0 | 32 | 32 | 2022-10-19 03:51:19.918 | 2022-10-19 03:51:19.918 | 2014-06-16 | 2022-08-31 | 2014-05-03 | 2021-12-06 | 67.090909 | 1476.0 | 22 | 2023-01-23 08:00:00 | 2023-01-23 08:00:00 | 2014-06-16 | 2022-09-07 | 0.0 | 0 | 32 | 0.03125 | 1 | 32 | 0.906250 | 29 | 32 | 0.062500 | 2 | 32 | 0.000000 | 0 | 32 | 0.000000 | 0 | 32 | 0.0 | 0 | 32 | 1.0 | 32 | 32 | 0.0 | 0 | 32 | 0.0 | 0 | 32 | 1.000000 | 32 | 32 | 0.000000 | 0 | 32 | 0.000000 | 0 | 32 | 0.000000 | 0 | 32 | 0.165853 | 5.307290 | 32 | 0.167984 | 5.375483 | 32 | 0.163515 | 5.232486 | 32 | 1.500000 | 48.0 | 32 | 818 |
| 3 | 4 | 0.153846 | 2 | 13 | 3911.923077 | 50855.0 | 13 | 23.0 | 299 | 13 | 2858.076923 | 37155.0 | 13 | 2020-10-07 | 2022-01-11 | 0.0 | 0.0 | 13 | 2 | 3447.923077 | 44823.0 | 13 | 7.0 | 91.0 | 13 | 2155.846154 | 28026.0 | 13 | 2022-01-14 | 2022-01-14 | 2022-01-21 | 2022-01-21 | 6.0 | 78.0 | 13 | 2022-10-06 13:07:47 | 2022-10-06 13:07:47 | 49.181818 | 541.0 | 11 | 4.870769 | 63.32 | 13 | 2021-11-30 | 2021-11-30 | 2022-01-14 | 2022-01-14 | 0.0 | 0.0 | 13 | 421.307692 | 5477.0 | 13 | 687.0 | 8931 | 13 | 2022-03-14 01:34:34.104 | 2022-03-14 01:34:34.104 | 0.0 | 0.0 | 13 | 2020-10-26 12:47:09 | 2020-10-26 12:47:09 | 2020-06-30 | 2021-12-15 | 5.900000 | 59.0 | 10 | 2020-09-30 | 2022-01-03 | 28.0 | 364 | 13 | 0.0 | 0 | 13 | 2022-02-19 22:16:14.775 | 2022-02-19 22:16:14.775 | 2020-09-01 | 2021-12-15 | 2020-06-30 | 2021-12-26 | 26.333333 | 158.0 | 6 | 2022-09-14 07:00:00 | 2022-09-14 07:00:00 | 2020-09-01 | 2022-01-03 | 0.0 | 0 | 13 | 0.00000 | 0 | 13 | 0.769231 | 10 | 13 | 0.000000 | 0 | 13 | 0.230769 | 3 | 13 | 0.000000 | 0 | 13 | 0.0 | 0 | 13 | 1.0 | 13 | 13 | 0.0 | 0 | 13 | 0.0 | 0 | 13 | 1.000000 | 13 | 13 | 0.000000 | 0 | 13 | 0.000000 | 0 | 13 | 0.000000 | 0 | 13 | 0.169715 | 2.206294 | 13 | 0.169220 | 2.199859 | 13 | 0.164236 | 2.135074 | 13 | 1.076923 | 14.0 | 13 | 687 |
| 4 | 5 | 6.204082 | 304 | 49 | 27515.775510 | 1348273.0 | 49 | 28.0 | 1372 | 49 | 18409.478261 | 846836.0 | 46 | 2007-08-01 | 2021-10-01 | 0.0 | 0.0 | 49 | 9 | 12025.612245 | 589255.0 | 49 | NaN | 0.0 | 0 | 10717.413043 | 493001.0 | 46 | 2021-10-04 | 2021-10-04 | 2021-10-04 | 2021-10-04 | 1.0 | 49.0 | 49 | 2022-10-18 13:12:47 | 2022-10-18 13:12:47 | 267.133333 | 12021.0 | 45 | 2.582857 | 126.56 | 49 | 2007-07-29 | 2021-08-31 | 2021-10-04 | 2021-10-04 | NaN | 0.0 | 0 | 9184.857143 | 450058.0 | 49 | 694.0 | 34006 | 49 | NaT | NaT | 0.0 | 0.0 | 49 | 2020-10-26 20:13:57 | 2020-10-26 20:13:57 | 2002-06-18 | 2021-05-14 | 48.621622 | 1799.0 | 37 | 2007-08-01 | 2021-09-30 | 6.0 | 294 | 49 | 0.0 | 0 | 49 | NaT | NaT | 2007-08-01 | 2021-08-17 | 2007-03-25 | 2021-09-04 | 528.000000 | 14256.0 | 27 | 2022-09-21 07:00:00 | 2022-09-21 07:00:00 | 2007-08-01 | 2021-09-24 | 0.0 | 0 | 49 | 0.00000 | 0 | 49 | 0.408163 | 20 | 49 | 0.000000 | 0 | 49 | 0.367347 | 18 | 49 | 0.224490 | 11 | 49 | 0.0 | 0 | 49 | 1.0 | 49 | 49 | 0.0 | 0 | 49 | 0.0 | 0 | 49 | 1.000000 | 49 | 49 | 0.000000 | 0 | 49 | 0.000000 | 0 | 49 | 0.000000 | 0 | 49 | 0.172096 | 8.432726 | 49 | 0.165770 | 8.122741 | 49 | 0.166272 | 8.147327 | 49 | 3.142857 | 154.0 | 49 | 694 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 9244 | 14391 | 2.538462 | 66 | 26 | 18744.423077 | 487355.0 | 26 | 7.0 | 182 | 26 | 4722.269231 | 122779.0 | 26 | 2019-10-31 | 2023-04-06 | 52.0 | 1352.0 | 26 | 3 | 541.576923 | 14081.0 | 26 | 1.0 | 26.0 | 26 | 5141.538462 | 133680.0 | 26 | 2023-04-21 | 2023-04-21 | 2023-04-20 | 2023-04-20 | 1.0 | 26.0 | 26 | 2023-04-27 15:08:48 | 2023-04-27 15:08:48 | 106.333333 | 2552.0 | 24 | 0.192308 | 5.00 | 26 | 2020-02-29 | 2023-02-28 | 2023-04-20 | 2023-04-20 | 0.0 | 0.0 | 26 | 541.576923 | 14081.0 | 26 | 0.0 | 0 | 26 | 2023-04-21 01:37:54.043 | 2023-04-21 01:37:54.043 | 52.0 | 1352.0 | 26 | 2023-04-20 20:01:02 | 2023-04-20 20:01:02 | 2019-05-31 | 2022-08-26 | 15.000000 | 345.0 | 23 | 2019-10-31 | 2023-03-31 | 21.0 | 546 | 26 | 0.0 | 0 | 26 | 2023-04-21 01:37:54.043 | 2023-04-21 01:37:54.043 | 2019-10-14 | 2023-03-30 | 2019-06-06 | 2023-03-01 | 212.571429 | 4464.0 | 21 | 2023-04-21 07:00:00 | 2023-04-21 07:00:00 | 2019-10-13 | 2023-03-01 | 0.0 | 0 | 26 | 0.00000 | 0 | 26 | 0.730769 | 19 | 26 | 0.038462 | 1 | 26 | 0.076923 | 2 | 26 | 0.153846 | 4 | 26 | 0.0 | 0 | 26 | 1.0 | 26 | 26 | 0.0 | 0 | 26 | 0.0 | 0 | 26 | 0.807692 | 21 | 26 | 0.038462 | 1 | 26 | 0.153846 | 4 | 26 | 0.000000 | 0 | 26 | 0.170023 | 4.420607 | 26 | 0.170693 | 4.438015 | 26 | 0.168942 | 4.392479 | 26 | 2.461538 | 64.0 | 26 | 0 |
| 9245 | 14404 | 9.115385 | 237 | 26 | 4369.269231 | 113601.0 | 26 | 3.0 | 78 | 26 | 3129.653846 | 81371.0 | 26 | 2005-12-18 | 2023-04-19 | 28.0 | 728.0 | 26 | 5 | 2609.807692 | 67855.0 | 26 | 1.0 | 26.0 | 26 | 2935.692308 | 76328.0 | 26 | 2023-04-23 | 2023-04-23 | NaT | NaT | NaN | 0.0 | 0 | 2023-04-27 17:24:48 | 2023-04-27 17:24:48 | 71.208333 | 1709.0 | 24 | 1.433077 | 37.26 | 26 | 2005-11-01 | 2023-03-31 | NaT | NaT | 0.0 | 0.0 | 26 | 2249.076923 | 58476.0 | 26 | 2.0 | 52 | 26 | 2023-04-23 06:05:02.883 | 2023-04-23 06:05:02.883 | 28.0 | 728.0 | 26 | 2023-04-22 20:37:42 | 2023-04-22 20:37:42 | 2005-05-31 | 2023-02-28 | 16.416667 | 394.0 | 24 | 2005-12-18 | 2023-04-19 | -1.0 | -26 | 26 | 0.0 | 0 | 26 | 2023-04-23 06:05:02.883 | 2023-04-23 06:05:02.883 | 2005-12-18 | 2023-03-06 | 2005-05-31 | 2023-03-11 | 128.937500 | 2063.0 | 16 | 2023-04-25 07:00:00 | 2023-04-25 07:00:00 | 2005-12-18 | 2023-04-10 | 0.0 | 0 | 26 | 0.00000 | 0 | 26 | 0.769231 | 20 | 26 | 0.000000 | 0 | 26 | 0.115385 | 3 | 26 | 0.115385 | 3 | 26 | 0.0 | 0 | 26 | 1.0 | 26 | 26 | 0.0 | 0 | 26 | 0.0 | 0 | 26 | 1.000000 | 26 | 26 | 0.000000 | 0 | 26 | 0.000000 | 0 | 26 | 0.000000 | 0 | 26 | 0.167796 | 4.362688 | 26 | 0.167519 | 4.355506 | 26 | 0.167344 | 4.350938 | 26 | 2.923077 | 76.0 | 26 | 2 |
| 9246 | 14408 | 35.250000 | 141 | 4 | 29629.000000 | 118516.0 | 4 | 4.0 | 16 | 4 | 0.000000 | 0.0 | 4 | 2021-02-28 | 2023-04-09 | 10.0 | 40.0 | 4 | 2 | 7764.250000 | 31057.0 | 4 | 1.0 | 4.0 | 4 | 7764.250000 | 31057.0 | 4 | 2023-04-23 | 2023-04-23 | 2023-04-23 | 2023-04-23 | 1.0 | 4.0 | 4 | 2023-04-27 09:32:32 | 2023-04-27 09:32:32 | 83.000000 | 332.0 | 4 | 0.500000 | 2.00 | 4 | 2020-12-20 | 2021-03-13 | 2023-04-23 | 2023-04-23 | 0.0 | 0.0 | 4 | 7764.250000 | 31057.0 | 4 | 0.0 | 0 | 4 | 2023-04-23 18:06:58.837 | 2023-04-23 18:06:58.837 | 10.0 | 40.0 | 4 | 2023-04-23 12:53:02 | 2023-04-23 12:53:02 | 2019-03-10 | 2021-02-22 | 63.500000 | 254.0 | 4 | 2021-02-28 | 2023-04-09 | 23.0 | 92 | 4 | 0.0 | 0 | 4 | 2023-04-23 18:06:58.837 | 2023-04-23 18:06:58.837 | 2021-02-17 | 2021-06-30 | 2019-03-10 | 2021-02-22 | 113.500000 | 227.0 | 2 | 2023-04-24 07:00:00 | 2023-04-24 07:00:00 | 2019-04-27 | 2021-06-30 | 0.0 | 0 | 4 | 0.00000 | 0 | 4 | 1.000000 | 4 | 4 | 0.000000 | 0 | 4 | 0.000000 | 0 | 4 | 0.000000 | 0 | 4 | 0.0 | 0 | 4 | 1.0 | 4 | 4 | 0.0 | 0 | 4 | 0.0 | 0 | 4 | 1.000000 | 4 | 4 | 0.000000 | 0 | 4 | 0.000000 | 0 | 4 | 0.000000 | 0 | 4 | 0.168101 | 0.672403 | 4 | 0.162282 | 0.649126 | 4 | 0.170136 | 0.680544 | 4 | 5.500000 | 22.0 | 4 | 0 |
| 9247 | 14410 | 2.262821 | 353 | 156 | 2796.339744 | 436229.0 | 156 | 3.0 | 468 | 156 | 123.384615 | 19248.0 | 156 | 2011-07-27 | 2023-04-19 | 33.0 | 5148.0 | 156 | 5 | 183.089744 | 28562.0 | 156 | 1.0 | 156.0 | 156 | 209.846154 | 32736.0 | 156 | 2023-04-23 | 2023-04-23 | 2023-04-23 | 2023-04-23 | 1.0 | 156.0 | 156 | 2023-04-26 21:25:09 | 2023-04-26 21:25:09 | 89.793333 | 13469.0 | 150 | 0.160256 | 25.00 | 156 | 2011-07-20 | 2023-04-12 | 2023-04-23 | 2023-04-23 | 0.0 | 0.0 | 156 | 168.448718 | 26278.0 | 156 | 0.0 | 0 | 156 | 2023-04-23 20:54:13.506 | 2023-04-23 20:54:13.506 | 33.0 | 5148.0 | 156 | 2023-04-23 15:49:12 | 2023-04-23 15:49:12 | 2011-03-25 | 2023-02-19 | 4.993506 | 769.0 | 154 | 2011-07-27 | 2023-04-19 | 0.0 | 0 | 156 | 0.0 | 0 | 156 | 2023-04-23 20:54:13.506 | 2023-04-23 20:54:13.506 | 2011-07-22 | 2023-02-22 | 2011-03-25 | 2023-03-24 | 85.902098 | 12284.0 | 143 | 2023-04-24 07:00:00 | 2023-04-24 07:00:00 | 2011-07-21 | 2023-03-22 | 0.0 | 0 | 156 | 0.00000 | 0 | 156 | 0.237179 | 37 | 156 | 0.730769 | 114 | 156 | 0.025641 | 4 | 156 | 0.006410 | 1 | 156 | 0.0 | 0 | 156 | 1.0 | 156 | 156 | 0.0 | 0 | 156 | 0.0 | 0 | 156 | 0.974359 | 152 | 156 | 0.000000 | 0 | 156 | 0.012821 | 2 | 156 | 0.012821 | 2 | 156 | 0.166609 | 25.991020 | 156 | 0.158096 | 24.663003 | 156 | 0.169450 | 26.434217 | 156 | 1.256410 | 196.0 | 156 | 0 |
| 9248 | 14416 | 1.821429 | 51 | 28 | 20347.642857 | 569734.0 | 28 | 8.0 | 224 | 28 | 21060.392857 | 589691.0 | 28 | 2009-04-03 | 2023-04-02 | 11.0 | 308.0 | 28 | 3 | 5381.107143 | 150671.0 | 28 | 1.0 | 28.0 | 28 | 15679.285714 | 439020.0 | 28 | 2023-04-24 | 2023-04-24 | 2023-04-24 | 2023-04-24 | 1.0 | 28.0 | 28 | 2023-05-05 21:09:11 | 2023-05-05 21:09:11 | 223.400000 | 5585.0 | 25 | 0.107143 | 3.00 | 28 | 2008-07-30 | 2023-03-31 | 2023-04-24 | 2023-04-24 | 0.0 | 0.0 | 28 | 5381.107143 | 150671.0 | 28 | 3.0 | 84 | 28 | 2023-04-24 23:36:58.288 | 2023-04-24 23:36:58.288 | 11.0 | 308.0 | 28 | 2023-04-24 18:36:17 | 2023-04-24 18:36:17 | 2005-11-04 | 2023-03-17 | 11.925926 | 322.0 | 27 | 2009-04-03 | 2023-04-02 | 0.0 | 0 | 28 | 0.0 | 0 | 28 | 2023-04-24 23:36:58.288 | 2023-04-24 23:36:58.288 | 2006-10-31 | 2023-02-22 | 2005-11-04 | 2021-12-30 | 587.960000 | 14699.0 | 25 | 2023-04-28 07:00:00 | 2023-04-28 07:00:00 | 2006-10-31 | 2023-01-12 | 0.0 | 0 | 28 | 0.00000 | 0 | 28 | 0.678571 | 19 | 28 | 0.107143 | 3 | 28 | 0.142857 | 4 | 28 | 0.071429 | 2 | 28 | 0.0 | 0 | 28 | 1.0 | 28 | 28 | 0.0 | 0 | 28 | 0.0 | 0 | 28 | 1.000000 | 28 | 28 | 0.000000 | 0 | 28 | 0.000000 | 0 | 28 | 0.000000 | 0 | 28 | 0.157965 | 4.423030 | 28 | 0.157886 | 4.420809 | 28 | 0.165593 | 4.636598 | 28 | 1.357143 | 38.0 | 28 | 3 |
9249 rows × 144 columns
# --- Choose only the variables I use for the model
model_features_final = [
'customer_id','target_mean','customer_lifetime','n_sf_apps_sum','prevention_key_nunique',
'account_to_application_days_count', 'loan_life_days_mean', 'max_credit_sum',
'total_payments_sum', 'n_inquiries_l3m_mean', 'payment_amount_sum',
'payment_to_due_ratio_mean', 'time_to_loan_days_mean',
'credit_limit_sum', 'n_bnpl_approved_apps_mean', 'outstanding_balance_mean',
'n_inquiries_l6m_mean', 'n_bnpl_apps_sum', 'max_days_late_sum', 'last_payment_days_sum',
'delayed_payments_sum', 'current_balance_sum', 'account_type_CREDITO DE HABILITACION O AVIO_sum',
'account_type_HIPOTECA_sum', 'account_type_PAGOS FIJOS_sum', 'account_type_POR DETERMINAR_sum',
'account_type_REVOLVENTE_sum', 'account_type_SIN LIMITE PREESTABLECIDO_sum', 'currency_DOLARES_sum',
'currency_PESOS MEXICANOS_sum', 'currency_UNIDADES DE INVERSION_sum', 'responsability_type_AVAL_sum',
'responsability_type_INDIVIDUAL (TITULAR)_sum', 'responsability_type_MANCOMUNADO_sum',
'responsability_type_OBLIGATORIO SOLIDARIO_sum', 'responsability_type_TITULAR CON AVAL_sum',
'prevention_key_risk_score_mean','gen_credit_type_mean','gen_business_type_mean','payment_frequency_mean'
]
# --- Create the final dataframe for the model
bankaya_grouped_df = grouped_df[model_features_final]
# ---- Lowercase all the columns in the datset
bankaya_grouped_df.columns = [col.lower() for col in bankaya_grouped_df.columns ]
# ---- Rename the columns for the datset and get a better understanding
new_cols_named = {
'target_mean':'target','n_sf_apps_sum':'n_sf_apps',
'account_to_application_days_count' : 'total_credits','max_credit_sum':'max_credit', 'total_payments_sum': 'total_payments_sum',
'payment_amount_sum':'payment_amount','payment_to_due_ratio_mean':'payment_to_due_ratio','credit_limit_sum':'credit_limit','outstanding_balance_mean':'outstanding_balance',
'n_bnpl_apps_sum':'n_bnpl_apps','max_days_late_sum':'max_days_late','last_payment_days_sum':'last_payment_days','delayed_payments_sum':'delayed_payments',
'current_balance_sum':'current_balance','account_type_CREDITO DE HABILITACION O AVIO_sum':'account_type_CREDITO_DE_HABILITACION_O_AVIO',
'account_type_HIPOTECA_sum':'account_type_HIPOTECA','account_type_PAGOS FIJOS_sum':'account_type_PAGOS_FIJOS',
'account_type_POR DETERMINAR_sum':'account_type_POR_DETERMINAR','account_type_REVOLVENTE_sum':'account_type_REVOLVENTE',
'account_type_SIN LIMITE PREESTABLECIDO_sum': 'account_type_SIN_LIMITE_PREESTABLECIDO','currency_DOLARES_sum':'currency_DOLARES',
'currency_PESOS MEXICANOS_sum':'currency_PESOS_MEXICANOS','currency_UNIDADES DE INVERSION_sum':'currency_UNIDADES_DE_INVERSION',
'responsability_type_AVAL_sum':'responsability_type_AVAL','responsability_type_INDIVIDUAL (TITULAR)_sum':'responsability_type_INDIVIDUAL(TITULAR)',
'responsability_type_MANCOMUNADO_sum':'responsability_type_MANCOMUNADO', 'responsability_type_OBLIGATORIO SOLIDARIO_sum': 'responsability_type_OBLIGATORIO_SOLIDARIO',
'responsability_type_TITULAR CON AVAL_sum':'responsability_type_TITULAR_CON_AVAL','prevention_key_risk_score_mean':'prevention_key_risk_score',
'gen_credit_type_mean':'gen_credit_type','gen_business_type_mean':'gen_business_type','payment_frequency_mean':'payment_frequency'
}
# Use the rename method with the dictionary to rename the columns
bankaya_grouped_df = bankaya_grouped_df.rename(columns=new_cols_named)
# --- Display final dataset t be modeled
bankaya_grouped_df
| customer_id | target | customer_lifetime | n_sf_apps | prevention_key_nunique | total_credits | loan_life_days_mean | max_credit | total_payments_sum | n_inquiries_l3m_mean | payment_amount | payment_to_due_ratio | time_to_loan_days_mean | credit_limit | n_bnpl_approved_apps_mean | outstanding_balance | n_inquiries_l6m_mean | n_bnpl_apps | max_days_late | last_payment_days | delayed_payments | current_balance | account_type_credito de habilitacion o avio_sum | account_type_hipoteca_sum | account_type_pagos fijos_sum | account_type_por determinar_sum | account_type_revolvente_sum | account_type_sin limite preestablecido_sum | currency_dolares_sum | currency_pesos mexicanos_sum | currency_unidades de inversion_sum | responsability_type_aval_sum | responsability_type_individual (titular)_sum | responsability_type_mancomunado_sum | responsability_type_obligatorio solidario_sum | responsability_type_titular con aval_sum | prevention_key_risk_score | gen_credit_type | gen_business_type | payment_frequency | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0.0 | 851 | 8.0 | 2 | 8 | 1341.500000 | 170056.0 | 1.0 | NaN | 1621.0 | 4.272500 | 4.0 | 552640.0 | 2.0 | 66834.000000 | NaN | 16.0 | 0 | 610.0 | 1 | 17968.0 | 0 | 0 | 0 | 0 | 6 | 2 | 0 | 8 | 0 | 0 | 8 | 0 | 0 | 0 | 1.250000 | 0.178660 | 0.173792 | 0.162429 |
| 1 | 2 | 0.0 | 828 | 80.0 | 4 | 10 | 1256.333333 | 152275.0 | 37.0 | 0.0 | 1689.0 | NaN | 2.0 | 338100.0 | 1.0 | 34702.333333 | 0.0 | 10.0 | -10 | 2425.0 | 0 | 25779.0 | 0 | 0 | 1 | 0 | 9 | 0 | 0 | 10 | 0 | 0 | 10 | 0 | 0 | 0 | 2.000000 | 0.183405 | 0.165376 | 0.162429 |
| 2 | 3 | 1.0 | 818 | 0.0 | 4 | 32 | 67.090909 | 471787.0 | 410.0 | 0.0 | 12143.0 | 4.142500 | 11.0 | 536040.0 | 0.0 | 3392.906250 | 31.0 | 32.0 | 2016 | 2373.0 | 85 | 472759.0 | 0 | 1 | 29 | 2 | 0 | 0 | 0 | 32 | 0 | 0 | 32 | 0 | 0 | 0 | 1.500000 | 0.165853 | 0.167984 | 0.163515 |
| 3 | 4 | 0.0 | 687 | 78.0 | 2 | 13 | 26.333333 | 50855.0 | 59.0 | 0.0 | 5477.0 | 4.870769 | 23.0 | 37155.0 | 0.0 | 2155.846154 | 0.0 | 91.0 | 364 | 541.0 | 2 | 44823.0 | 0 | 0 | 10 | 0 | 3 | 0 | 0 | 13 | 0 | 0 | 13 | 0 | 0 | 0 | 1.076923 | 0.169715 | 0.169220 | 0.164236 |
| 4 | 5 | 0.0 | 694 | 49.0 | 9 | 49 | 528.000000 | 1348273.0 | 1799.0 | 0.0 | 450058.0 | 2.582857 | 28.0 | 846836.0 | NaN | 10717.413043 | 0.0 | 0.0 | 294 | 12021.0 | 304 | 589255.0 | 0 | 0 | 20 | 0 | 18 | 11 | 0 | 49 | 0 | 0 | 49 | 0 | 0 | 0 | 3.142857 | 0.172096 | 0.165770 | 0.166272 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 9244 | 14391 | 0.0 | 0 | 26.0 | 3 | 26 | 212.571429 | 487355.0 | 345.0 | 52.0 | 14081.0 | 0.192308 | 7.0 | 122779.0 | 0.0 | 5141.538462 | 52.0 | 26.0 | 546 | 2552.0 | 66 | 14081.0 | 0 | 0 | 19 | 1 | 2 | 4 | 0 | 26 | 0 | 0 | 21 | 1 | 4 | 0 | 2.461538 | 0.170023 | 0.170693 | 0.168942 |
| 9245 | 14404 | 0.0 | 2 | 0.0 | 5 | 26 | 128.937500 | 113601.0 | 394.0 | 28.0 | 58476.0 | 1.433077 | 3.0 | 81371.0 | 0.0 | 2935.692308 | 28.0 | 26.0 | -26 | 1709.0 | 237 | 67855.0 | 0 | 0 | 20 | 0 | 3 | 3 | 0 | 26 | 0 | 0 | 26 | 0 | 0 | 0 | 2.923077 | 0.167796 | 0.167519 | 0.167344 |
| 9246 | 14408 | 0.0 | 0 | 4.0 | 2 | 4 | 113.500000 | 118516.0 | 254.0 | 10.0 | 31057.0 | 0.500000 | 4.0 | 0.0 | 0.0 | 7764.250000 | 10.0 | 4.0 | 92 | 332.0 | 141 | 31057.0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 4 | 0 | 0 | 0 | 5.500000 | 0.168101 | 0.162282 | 0.170136 |
| 9247 | 14410 | 0.0 | 0 | 156.0 | 5 | 156 | 85.902098 | 436229.0 | 769.0 | 33.0 | 26278.0 | 0.160256 | 3.0 | 19248.0 | 0.0 | 209.846154 | 33.0 | 156.0 | 0 | 13469.0 | 353 | 28562.0 | 0 | 0 | 37 | 114 | 4 | 1 | 0 | 156 | 0 | 0 | 152 | 0 | 2 | 2 | 1.256410 | 0.166609 | 0.158096 | 0.169450 |
| 9248 | 14416 | 0.0 | 3 | 28.0 | 3 | 28 | 587.960000 | 569734.0 | 322.0 | 11.0 | 150671.0 | 0.107143 | 8.0 | 589691.0 | 0.0 | 15679.285714 | 11.0 | 28.0 | 0 | 5585.0 | 51 | 150671.0 | 0 | 0 | 19 | 3 | 4 | 2 | 0 | 28 | 0 | 0 | 28 | 0 | 0 | 0 | 1.357143 | 0.157965 | 0.157886 | 0.165593 |
9249 rows × 40 columns
from sklearn.cluster import KMeans
import numpy as np
# Prepare the data for clustering
# Remove NaN values and reshape the data to meet the requirements of the KMeans algorithm
customer_tenure_data = bankaya_grouped_df['customer_lifetime'].dropna().values.reshape(-1, 1)
# Using the elbow method to find the optimal number of clusters
inertia = []
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, random_state=0)
kmeans.fit(customer_tenure_data)
inertia.append(kmeans.inertia_)
# Plotting the elbow graph
plt.figure(figsize=(10, 5))
plt.plot(range(1, 11), inertia, marker='o')
plt.title('Elbow Method For Optimal Number of Clusters')
plt.xlabel('Number of Clusters')
plt.ylabel('Inertia')
plt.show()
# ---Perform K-means clustering with 3 clusters (as one of the potential elbow points)
kmeans = KMeans(n_clusters=3, random_state=0)
bankaya_grouped_df['Lifetime_Cluster'] = kmeans.fit_predict(customer_tenure_data)
#---Calculate the mean and count for each cluster
cluster_summary = bankaya_grouped_df.groupby('Lifetime_Cluster')['customer_lifetime'].agg(['mean', 'count']).reset_index()
# ---Sort clusters by the mean tenure
cluster_summary = cluster_summary.sort_values(by='mean')
# ---Rename clusters based on the sorted mean tenure
cluster_map = {row['Lifetime_Cluster']: 'Cluster_{}'.format(index) for index, row in cluster_summary.iterrows()}
bankaya_grouped_df['Lifetime_Cluster'] = bankaya_grouped_df['Lifetime_Cluster'].map(cluster_map)
bankaya_grouped_df['Lifetime_Cluster'] = bankaya_grouped_df['Lifetime_Cluster'].replace({'Cluster_1': 'Long_Tenure',
'Cluster_0': 'Medium_Tenure',
'Cluster_2': 'Short_Tenure'})
# ---Show the cluster summary and some sample rows
print(cluster_summary)
bankaya_grouped_df.sample(5)
Lifetime_Cluster mean count 1 1 70.579334 4355 0 0 261.493224 3173 2 2 516.891342 1721
| customer_id | target | customer_lifetime | n_sf_apps | prevention_key_nunique | total_credits | loan_life_days_mean | max_credit | total_payments_sum | n_inquiries_l3m_mean | payment_amount | payment_to_due_ratio | time_to_loan_days_mean | credit_limit | n_bnpl_approved_apps_mean | outstanding_balance | n_inquiries_l6m_mean | n_bnpl_apps | max_days_late | last_payment_days | delayed_payments | current_balance | account_type_credito de habilitacion o avio_sum | account_type_hipoteca_sum | account_type_pagos fijos_sum | account_type_por determinar_sum | account_type_revolvente_sum | account_type_sin limite preestablecido_sum | currency_dolares_sum | currency_pesos mexicanos_sum | currency_unidades de inversion_sum | responsability_type_aval_sum | responsability_type_individual (titular)_sum | responsability_type_mancomunado_sum | responsability_type_obligatorio solidario_sum | responsability_type_titular con aval_sum | prevention_key_risk_score | gen_credit_type | gen_business_type | payment_frequency | Lifetime_Cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3535 | 4290 | 1.0 | 190 | 28.0 | 3 | 28 | 44.923077 | 47222.0 | 102.0 | 0.0 | 15731.0 | 2.977500 | 2.0 | 17106.0 | 0.0 | 601.571429 | 0.0 | 56.0 | 1736 | 964.0 | 177 | 30148.0 | 0 | 0 | 21 | 0 | 3 | 4 | 0 | 28 | 0 | 0 | 28 | 0 | 0 | 0 | 1.321429 | 0.168919 | 0.166691 | 0.163791 | Medium_Tenure |
| 8470 | 11758 | 0.0 | 8 | 13.0 | 4 | 13 | 217.750000 | 71356.0 | 149.0 | 16.0 | 3377.0 | 10.236154 | 13.0 | 24469.0 | 0.0 | 3162.000000 | 16.0 | 13.0 | 13 | 1008.0 | 49 | 19089.0 | 0 | 0 | 7 | 0 | 0 | 6 | 0 | 13 | 0 | 0 | 13 | 0 | 0 | 0 | 3.076923 | 0.168947 | 0.175107 | 0.166437 | Long_Tenure |
| 3862 | 4692 | 0.0 | 251 | 4.0 | 2 | 4 | NaN | 64351.0 | 906.0 | 0.0 | 86493.0 | 9.342500 | 8.0 | 51500.0 | 1.0 | 11773.000000 | 0.0 | 8.0 | 24 | 74.0 | 25 | 98572.0 | 0 | 0 | 1 | 0 | 3 | 0 | 0 | 4 | 0 | 0 | 4 | 0 | 0 | 0 | 3.000000 | 0.173848 | 0.166270 | 0.166254 | Medium_Tenure |
| 9019 | 13611 | 0.0 | 0 | 0.0 | 7 | 60 | 143.975610 | 225496.0 | 313.0 | 70.0 | 244416.0 | 0.696000 | 5.0 | 166976.0 | NaN | 4564.859649 | 70.0 | 0.0 | 0 | 3992.0 | 218 | 253137.0 | 0 | 0 | 49 | 0 | 9 | 2 | 0 | 60 | 0 | 0 | 60 | 0 | 0 | 0 | 2.016667 | 0.168732 | 0.168494 | 0.166905 | Long_Tenure |
| 5797 | 7404 | 0.0 | 204 | 0.0 | 4 | 21 | 145.750000 | 192926.0 | 1162.0 | 0.0 | 21165.0 | 0.341905 | 3.0 | 147853.0 | 0.0 | 5801.047619 | 0.0 | 21.0 | 0 | 2362.0 | 52 | 26031.0 | 0 | 0 | 17 | 0 | 4 | 0 | 0 | 21 | 0 | 0 | 21 | 0 | 0 | 0 | 2.047619 | 0.167250 | 0.167660 | 0.167200 | Medium_Tenure |
Cluster_2 (Short Tenure): With an average tenure of about 71 days, these are relatively new customers. They might need more educational content and initial offers to fully engage with the services.
Cluster_0 (Medium Tenure): With an average tenure of about 261 days, these customers have been engaged for a reasonable time. They could be targeted for up-sell or cross-sell opportunities.
Cluster_1 (Long Tenure): With an average tenure of about 517 days, these are the most loyal customers who might be more open to premium services or loyalty programs.
The average customer tenure is around 219 days, but there's a wide range, up to 901 days. This indicates that while some customers are relatively new, others have been engaged for a longer time. Understanding customer tenure can help in segmenting the customer base for targeted marketing. For example, longer-tenured customers might be more receptive to cross-selling or up-selling initiatives, while shorter-tenured customers might benefit from welcome offers or educational content.
First, I will identify the key columns I need for RFM analysis:
# Calculate Recency, Frequency, and Monetary value for each customer
# Recency
recency_df = bankaya_final_model_df.groupby('customer_id')['last_payment_date'].max().reset_index()
recency_df.columns = ['customer_id', 'Recent_Payment_Date']
recency_df['Recency'] = (bankaya_final_model_df['last_payment_date'].max() - recency_df['Recent_Payment_Date']).dt.days
# Frequency
frequency_df = bankaya_final_model_df.groupby('customer_id')['application_datetime'].count().reset_index()
frequency_df.columns = ['customer_id', 'Frequency']
# Monetary
monetary_df = bankaya_final_model_df.groupby('customer_id')['current_balance'].mean().reset_index()
monetary_df.columns = ['customer_id', 'Monetary']
# Merge these dataframes to create the RFM table
rfm_df = pd.merge(recency_df, frequency_df, on='customer_id')
rfm_df = pd.merge(rfm_df, monetary_df, on='customer_id')
# Show some statistics and sample rows for the RFM features
rfm_stats = rfm_df[['Recency', 'Frequency', 'Monetary']].describe()
rfm_sample = rfm_df.sample(5)
rfm_stats
| Recency | Frequency | Monetary | |
|---|---|---|---|
| count | 9145.000000 | 9249.000000 | 9172.000000 |
| mean | 463.319956 | 30.839766 | 6852.744311 |
| std | 424.451785 | 30.582474 | 14799.353291 |
| min | 0.000000 | 1.000000 | 0.000000 |
| 25% | 249.000000 | 10.000000 | 1223.190000 |
| 50% | 374.000000 | 21.000000 | 2614.046296 |
| 75% | 542.000000 | 41.000000 | 5960.183279 |
| max | 6474.000000 | 269.000000 | 481599.500000 |
rfm_sample
| customer_id | Recent_Payment_Date | Recency | Frequency | Monetary | |
|---|---|---|---|---|---|
| 7671 | 10605 | 2022-10-17 | 202.0 | 28 | 1878.642857 |
| 73 | 90 | 2022-06-10 | 331.0 | 91 | 432.857143 |
| 8602 | 12060 | 2022-12-19 | 139.0 | 19 | 2620.000000 |
| 8324 | 11564 | 2022-11-18 | 170.0 | 9 | 226.555556 |
| 8871 | 13084 | 2023-01-09 | 118.0 | 139 | 3130.467626 |
Recency: A high average recency of 463 days suggests that many customers have not interacted with the services for a significant amount of time. Retention campaigns might be needed.
Frequency: The average frequency is 31, indicating that customers tend to apply for loans or other services multiple times. Cross-selling or up-selling strategies could be effective.
Monetary: The high variation in the monetary value, ranging from $0 to $481,600, suggests a diverse customer base in terms of value. Segmenting customers based on this could help in personalized marketing.
High-Value Customers: Customers with low recency, high frequency, and high monetary values are generally the most valuable. Focused retention strategies can be designed for this segment.
# --- Further segmenting the RFM scores using quantiles
# For Recency, lower the value, better the segment. Hence, labels are in reverse order
rfm_df['R_Segment'] = pd.qcut(rfm_df['Recency'], 4, labels=['4', '3', '2', '1'])
# --- For Frequency and Monetary, higher the value, better the segment
rfm_df['F_Segment'] = pd.qcut(rfm_df['Frequency'], 4, labels=['1', '2', '3', '4'])
rfm_df['M_Segment'] = pd.qcut(rfm_df['Monetary'], 4, labels=['1', '2', '3', '4'])
# --- Combine the individual segments to create a combined RFM score
rfm_df['RFM_Score'] = rfm_df['R_Segment'].astype(str) + rfm_df['F_Segment'].astype(str) + rfm_df['M_Segment'].astype(str)
# --- -Define the RFM level function
def rfm_level(df):
if df['RFM_Score'] == '444':
return 'Champion'
elif df['F_Segment'] == '4' and df['M_Segment'] == '4':
return 'Loyal Customer'
elif df['R_Segment'] == '1':
return 'At Risk'
elif df['R_Segment'] == '2' and df['F_Segment'] == '1':
return 'Needs Attention'
else:
return 'Average'
# --- Calculate RFM Level
rfm_df['RFM_Level'] = rfm_df.apply(rfm_level, axis=1)
# --- Show the distribution of RFM Level and some sample rows
rfm_level_summary = rfm_df['RFM_Level'].value_counts().reset_index()
rfm_level_summary.columns = ['RFM_Level', 'Count']
rfm_sample_segmented = rfm_df
rfm_level_summary
| RFM_Level | Count | |
|---|---|---|
| 0 | Average | 6168 |
| 1 | At Risk | 2222 |
| 2 | Needs Attention | 510 |
| 3 | Loyal Customer | 222 |
| 4 | Champion | 127 |
# --- Display the RFM scores. I will use these scores to included in the model
rfm_sample_segmented
| customer_id | Recent_Payment_Date | Recency | Frequency | Monetary | R_Segment | F_Segment | M_Segment | RFM_Score | RFM_Level | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2022-05-27 | 345.0 | 8 | 2246.000000 | 3 | 1 | 2 | 312 | Average |
| 1 | 2 | 2022-04-16 | 386.0 | 10 | 2577.900000 | 2 | 1 | 2 | 212 | Needs Attention |
| 2 | 3 | 2022-09-07 | 242.0 | 32 | 14773.718750 | 4 | 3 | 4 | 434 | Average |
| 3 | 4 | 2022-01-03 | 489.0 | 13 | 3447.923077 | 2 | 2 | 3 | 223 | Average |
| 4 | 5 | 2021-09-24 | 590.0 | 49 | 12025.612245 | 1 | 4 | 4 | 144 | Loyal Customer |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 9244 | 14391 | 2023-03-01 | 67.0 | 26 | 541.576923 | 4 | 3 | 1 | 431 | Average |
| 9245 | 14404 | 2023-04-10 | 27.0 | 26 | 2609.807692 | 4 | 3 | 2 | 432 | Average |
| 9246 | 14408 | 2021-06-30 | 676.0 | 4 | 7764.250000 | 1 | 1 | 4 | 114 | At Risk |
| 9247 | 14410 | 2023-03-22 | 46.0 | 156 | 183.089744 | 4 | 4 | 1 | 441 | Average |
| 9248 | 14416 | 2023-01-12 | 115.0 | 28 | 5381.107143 | 4 | 3 | 3 | 433 | Average |
9249 rows × 10 columns
Champion: These are your best customers who are recent, frequent, and high spenders. Focus on retaining them and offering them personalized services.
Loyal Customer: These customers are frequent and high spenders but might not be as recent. Keep them engaged through loyalty programs or personalized offers.
At Risk: These are customers who have not interacted recently but have a history of frequent interactions and spending. Retargeting campaigns can bring them back.
Needs Attention: These customers have not interacted recently and have low frequency and monetary value. They require nurturing to move up the value chain.
Average: These customers fall in between all these categories. Regular marketing campaigns could be effective for this group.
## --- Enhancing the DataFrame with RFM Scores for Modeling:
bankaya_model_merged_df = bankaya_grouped_df.merge(rfm_df[['customer_id', 'RFM_Score']], on='customer_id', how='left')
# Replace 'nan' strings with actual NaN values (np.nan)
bankaya_model_merged_df['RFM_Score'] = bankaya_model_merged_df['RFM_Score'].replace('nan', np.nan, regex=True)
# Now I can fill NaN values with a placeholder, or drop them, etc.
bankaya_model_merged_df['RFM_Score'] = bankaya_model_merged_df['RFM_Score'].fillna(0)
# Attempt conversion to integer
bankaya_model_merged_df['RFM_Score'] = bankaya_model_merged_df['RFM_Score'].astype(int)
bankaya_model_merged_df.sample(5)
| customer_id | target | customer_lifetime | n_sf_apps | prevention_key_nunique | total_credits | loan_life_days_mean | max_credit | total_payments_sum | n_inquiries_l3m_mean | payment_amount | payment_to_due_ratio | time_to_loan_days_mean | credit_limit | n_bnpl_approved_apps_mean | outstanding_balance | n_inquiries_l6m_mean | n_bnpl_apps | max_days_late | last_payment_days | delayed_payments | current_balance | account_type_credito de habilitacion o avio_sum | account_type_hipoteca_sum | account_type_pagos fijos_sum | account_type_por determinar_sum | account_type_revolvente_sum | account_type_sin limite preestablecido_sum | currency_dolares_sum | currency_pesos mexicanos_sum | currency_unidades de inversion_sum | responsability_type_aval_sum | responsability_type_individual (titular)_sum | responsability_type_mancomunado_sum | responsability_type_obligatorio solidario_sum | responsability_type_titular con aval_sum | prevention_key_risk_score | gen_credit_type | gen_business_type | payment_frequency | Lifetime_Cluster | RFM_Score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1953 | 2269 | 0.0 | 538 | 242.0 | 5 | 121 | 121.812500 | 514592.0 | 754.0 | 0.0 | 196207.0 | 0.791488 | 6.0 | 286582.0 | 0.0 | 2498.965217 | 0.0 | 121.0 | 3267 | 14090.0 | 285 | 246840.0 | 0 | 0 | 73 | 37 | 7 | 4 | 0 | 121 | 0 | 0 | 82 | 1 | 0 | 38 | 1.685950 | 0.166163 | 0.164210 | 0.168463 | Short_Tenure | 142 |
| 2918 | 3525 | 0.0 | 243 | 19.0 | 2 | 19 | 40.166667 | 98732.0 | 241.0 | 0.0 | 32347.0 | 1.493158 | 4.0 | 23265.0 | 0.0 | 2709.368421 | 0.0 | 19.0 | 228 | 2043.0 | 178 | 33943.0 | 0 | 0 | 6 | 10 | 1 | 2 | 0 | 19 | 0 | 0 | 19 | 0 | 0 | 0 | 2.263158 | 0.166774 | 0.161613 | 0.168102 | Medium_Tenure | 122 |
| 6528 | 8768 | 1.0 | 143 | 1.0 | 1 | 1 | 252.000000 | 5213.0 | 16.0 | 0.0 | 3558.0 | 1.000000 | 30.0 | 0.0 | 0.0 | 3558.000000 | 8.0 | 1.0 | 42 | 252.0 | 9 | 3558.0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 9.000000 | 0.166172 | 0.169220 | 0.170136 | Long_Tenure | 213 |
| 2842 | 3429 | 0.0 | 318 | 114.0 | 4 | 57 | 164.217391 | 765839.0 | 1426.0 | 0.0 | 139123.0 | 5.715088 | 23.0 | 474741.0 | 1.0 | 1334.272727 | 0.0 | 57.0 | 1596 | 5605.0 | 277 | 503648.0 | 0 | 2 | 53 | 0 | 1 | 1 | 0 | 57 | 0 | 0 | 55 | 0 | 2 | 0 | 2.561404 | 0.167614 | 0.166786 | 0.167638 | Medium_Tenure | 244 |
| 4817 | 5967 | 1.0 | 64 | 0.0 | 4 | 29 | 185.454545 | 103264.0 | 390.0 | 26.0 | 21002.0 | 0.922069 | 107.0 | 30888.0 | 0.0 | 1632.827586 | 26.0 | 29.0 | 1798 | 3987.0 | 34 | 23588.0 | 0 | 0 | 25 | 0 | 1 | 3 | 0 | 29 | 0 | 0 | 29 | 0 | 0 | 0 | 1.793103 | 0.167236 | 0.171364 | 0.167094 | Long_Tenure | 331 |
# --- Manual mappings to assign the values to each category
lifetime_cluster_mapping = {
'Short_Tenure': 1,
'Medium_Tenure': 2,
'Long_Tenure': 3,
}
# --- Mapping the lifetime cluster and the RFM importance with the scores given.
bankaya_model_merged_df['Lifetime_Cluster_score'] = bankaya_model_merged_df['Lifetime_Cluster'].map(lifetime_cluster_mapping)
# --- Drop unnecessary columns
bankaya_model_merged_df.drop("Lifetime_Cluster",inplace=True,axis=1)
# <<<< Function to create the pipeline for the model >>>>
def transform_data_test():
continuous_pipeline = Pipeline([
("imputer",SimpleImputer(strategy="mean")),
("scaler",RobustScaler(with_centering=True,with_scaling=True))
])
preproccessing_pipeline = ColumnTransformer([
("continous",continuous_pipeline,model_numerical_vars)
],
remainder = "drop"
)
pipeline = ImbPipeline(steps=([
("preprocess",preproccessing_pipeline),
("smote", SMOTE(random_state=42))
]))
return pipeline
#!pip install imbalanced-learn
# --- Identify all numerical variables in the dataset for the modeling pprocesing.
model_numerical_vars = [col for col in bankaya_model_merged_df.columns if 'target'not in col and 'customer_id' not in col ]
# --- Pipeline process to be applied to the X_train and X_test
pipeline_data = transform_data_test()
pipeline_data
Pipeline(steps=[('preprocess',
ColumnTransformer(transformers=[('continous',
Pipeline(steps=[('imputer',
SimpleImputer()),
('scaler',
RobustScaler())]),
['customer_lifetime',
'n_sf_apps',
'prevention_key_nunique',
'total_credits',
'loan_life_days_mean',
'max_credit',
'total_payments_sum',
'n_inquiries_l3m_mean',
'payment_amount',
'payment_to_due_ratio',
'time_to_l...
'habilitacion o avio_sum',
'account_type_hipoteca_sum',
'account_type_pagos '
'fijos_sum',
'account_type_por '
'determinar_sum',
'account_type_revolvente_sum',
'account_type_sin limite '
'preestablecido_sum',
'currency_dolares_sum',
'currency_pesos '
'mexicanos_sum',
'currency_unidades de '
'inversion_sum',
'responsability_type_aval_sum', ...])])),
('smote', SMOTE(random_state=42))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. Pipeline(steps=[('preprocess',
ColumnTransformer(transformers=[('continous',
Pipeline(steps=[('imputer',
SimpleImputer()),
('scaler',
RobustScaler())]),
['customer_lifetime',
'n_sf_apps',
'prevention_key_nunique',
'total_credits',
'loan_life_days_mean',
'max_credit',
'total_payments_sum',
'n_inquiries_l3m_mean',
'payment_amount',
'payment_to_due_ratio',
'time_to_l...
'habilitacion o avio_sum',
'account_type_hipoteca_sum',
'account_type_pagos '
'fijos_sum',
'account_type_por '
'determinar_sum',
'account_type_revolvente_sum',
'account_type_sin limite '
'preestablecido_sum',
'currency_dolares_sum',
'currency_pesos '
'mexicanos_sum',
'currency_unidades de '
'inversion_sum',
'responsability_type_aval_sum', ...])])),
('smote', SMOTE(random_state=42))])ColumnTransformer(transformers=[('continous',
Pipeline(steps=[('imputer', SimpleImputer()),
('scaler', RobustScaler())]),
['customer_lifetime', 'n_sf_apps',
'prevention_key_nunique', 'total_credits',
'loan_life_days_mean', 'max_credit',
'total_payments_sum', 'n_inquiries_l3m_mean',
'payment_amount', 'payment_to_due_ratio',
'time_to_loan_days_mean', 'credit_limit',...
'current_balance',
'account_type_credito de habilitacion o '
'avio_sum',
'account_type_hipoteca_sum',
'account_type_pagos fijos_sum',
'account_type_por determinar_sum',
'account_type_revolvente_sum',
'account_type_sin limite preestablecido_sum',
'currency_dolares_sum',
'currency_pesos mexicanos_sum',
'currency_unidades de inversion_sum',
'responsability_type_aval_sum', ...])])['customer_lifetime', 'n_sf_apps', 'prevention_key_nunique', 'total_credits', 'loan_life_days_mean', 'max_credit', 'total_payments_sum', 'n_inquiries_l3m_mean', 'payment_amount', 'payment_to_due_ratio', 'time_to_loan_days_mean', 'credit_limit', 'n_bnpl_approved_apps_mean', 'outstanding_balance', 'n_inquiries_l6m_mean', 'n_bnpl_apps', 'max_days_late', 'last_payment_days', 'delayed_payments', 'current_balance', 'account_type_credito de habilitacion o avio_sum', 'account_type_hipoteca_sum', 'account_type_pagos fijos_sum', 'account_type_por determinar_sum', 'account_type_revolvente_sum', 'account_type_sin limite preestablecido_sum', 'currency_dolares_sum', 'currency_pesos mexicanos_sum', 'currency_unidades de inversion_sum', 'responsability_type_aval_sum', 'responsability_type_individual (titular)_sum', 'responsability_type_mancomunado_sum', 'responsability_type_obligatorio solidario_sum', 'responsability_type_titular con aval_sum', 'prevention_key_risk_score', 'gen_credit_type', 'gen_business_type', 'payment_frequency', 'RFM_Score', 'Lifetime_Cluster_score']
SimpleImputer()
RobustScaler()
SMOTE(random_state=42)
# <<<< Búsqueda de hiperparametros de cada modelo >>>>
classifiers = {
'Bagging': BaggingClassifier(random_state=314),
'RF': RandomForestClassifier(random_state=314,n_jobs=-1),
'AdaBoost': AdaBoostClassifier(base_estimator = DecisionTreeClassifier(max_depth=2, random_state=314),random_state=314),
'GBoosting': GradientBoostingClassifier(random_state=314),
"XGBoost": xgb.XGBClassifier(random_state = 314),
'Logistic_r':LogisticRegression(random_state = 314),
'D_tree':DecisionTreeClassifier(criterion='gini', random_state=314)
}
params = {
'RF':{'n_estimators':range(1,11),
'max_depth': range(1,11),
"min_samples_split":[2,4]
},
'AdaBoost': {'learning_rate':[1e-5, 1e-4,1e-3, 1e-2, 1e-1, 1.0],
'n_estimators':range(1,11)
},
'GBoosting': {'max_depth':range(1,11),
'n_estimators':range(1,11),
'learning_rate':[1e-3, 1e-2, 1e-1,1.0]
},
'XGBoost': {'max_depth':range(1,11),
'n_estimators':range(1,11),
"learning_rate":[1e-3, 1e-2, 1e-1,1.0],
'eval_metric': ['logloss']
},
'Logistic_r': {'penalty' : ['l1'],
'C': [0.01, 0.03,0.04,0.05,0.06],
'solver': ['saga'],
"max_iter":[5000,10000]
},
'D_tree': { 'max_depth': [2,4,5,6,7,8],
'min_samples_leaf': [1, 5, 10, 20, 35],
'min_samples_split': [2, 15, 30, 60],
}
}
#--- Split DataFrame into train and test
y_bankaya_model = bankaya_model_merged_df["target"]
X_bankaya_model = bankaya_model_merged_df.drop(["target","customer_id"],axis=1) # --- Remove unneccessary columns
# --- Split model into Train and Test
X_train, X_test, y_train, y_test = train_test_split(X_bankaya_model,y_bankaya_model, test_size=0.2, random_state=42, stratify=y_bankaya_model)
# Replace infinite values with NaN
X_train.replace([np.inf, -np.inf], np.nan, inplace=True)
X_test.replace([np.inf, -np.inf], np.nan, inplace=True)
print("Number transactions X_train dataset: ", X_train.shape)
print("Number transactions y_train dataset: ", y_train.shape)
print("Number transactions X_test dataset: ", X_test.shape)
print("Number transactions y_test dataset: ", y_test.shape)
print("\n========Variables objetivo dataset Train===========\n")
print("Before OverSampling, counts of target '1': {}".format(sum(y_train==1)))
print("Before OverSampling, counts of target '0': {} \n".format(sum(y_train==0)))
Number transactions X_train dataset: (7399, 40) Number transactions y_train dataset: (7399,) Number transactions X_test dataset: (1850, 40) Number transactions y_test dataset: (1850,) ========Variables objetivo dataset Train=========== Before OverSampling, counts of target '1': 1380 Before OverSampling, counts of target '0': 6019
# --- Use fit_resample method just in train to apply the resample SMOTE method
X_train_resampled, y_train_resampled = pipeline_data.fit_resample(X_train, y_train)
# First, get the preprocessing and transformation steps without the SMOTE step
preprocess_pipeline = pipeline_data[:-1]
# --- Apply the transform method to the test set using the obtained preprocessing pipeline
X_test_scaled = preprocess_pipeline.transform(X_test) # apply transform just in test
# Convert the scaled and transformed X_train back to a DataFrame
X_train_resampled_df = pd.DataFrame(X_train_resampled, columns=model_numerical_vars)
# Convert the scaled and transformed X_test back to a DataFrame
X_test_scaled_df = pd.DataFrame(X_test_scaled, columns=model_numerical_vars)
print('After SMOTE, the shape of X_train_under: {}'.format(X_train_resampled.shape))
print('After SMOTE, the shape of y_train_under: {} \n'.format(y_train_resampled.shape))
print("After SMOTE, counts of label '1': {}".format(sum(y_train_resampled==1)))
print("After SMOTE, counts of label '0': {}".format(sum(y_train_resampled==0)))
After SMOTE, the shape of X_train_under: (12038, 40) After SMOTE, the shape of y_train_under: (12038,) After SMOTE, counts of label '1': 6019 After SMOTE, counts of label '0': 6019
import warnings
warnings.filterwarnings("ignore")
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingGridSearchCV
from sklearn.experimental import enable_halving_search_cv
# Create an empety dictionary where the model will be stored
clasifier_dict = {}
# < ===== GridSeacrh del modelo XGBoost ====== >
for name in list(params.keys()):
if name == "XGBoost":
print("================>")
print(name)
print("================>")
clasifier_dict[name] = GridSearchCV(classifiers[name],
params[name],
#param_grid= pipeline["classifier"].available_models[name],
scoring='roc_auc',
return_train_score=True,
n_jobs=-1, cv=3, verbose=1)
clasifier_dict[name].fit(X_train_resampled, y_train_resampled)
================> XGBoost ================> Fitting 3 folds for each of 400 candidates, totalling 1200 fits
# < ======= GridSeacrh del modelo Logistic regression ========= >
for name in list(params.keys()):
if name == "Logistic_r":
print("================>")
print(name)
print("================>")
clasifier_dict[name] = GridSearchCV(estimator= classifiers[name],
param_grid=params[name],
#param_grid= pipeline["classifier"].available_models[name],
scoring="roc_auc",
return_train_score=True,
n_jobs=-1, cv=3, verbose=1)
clasifier_dict[name].fit(X_train_resampled, y_train_resampled)
================> Logistic_r ================> Fitting 3 folds for each of 10 candidates, totalling 30 fits
# < ======= GridSeacrh del modelo GBoosting ========= >
for name in list(params.keys()):
if name == "GBoosting":
print("================>")
print(name)
print("================>")
clasifier_dict[name] = GridSearchCV(estimator= classifiers[name],
param_grid=params[name],
#param_grid= pipeline["classifier"].available_models[name],
scoring="roc_auc",
return_train_score=True,
n_jobs=-1, cv=3, verbose=1)
clasifier_dict[name].fit(X_train_resampled, y_train_resampled)
================> GBoosting ================> Fitting 3 folds for each of 400 candidates, totalling 1200 fits
# < ======= GridSeacrh del modelo AdaBoost ========= >
for name in list(params.keys()):
if name == "AdaBoost":
print("================>")
print(name)
print("================>")
clasifier_dict[name] = GridSearchCV(estimator= classifiers[name],
param_grid=params[name],
scoring='roc_auc',
return_train_score=True,
n_jobs=-1, cv=3, verbose=1)
clasifier_dict[name].fit(X_train_resampled, y_train_resampled)
# < ======= GridSeacrh del modelo Random Forest ========= >
for name in list(params.keys()):
if name == "RF":
print("================>")
print(name)
print("================>")
clasifier_dict[name] = GridSearchCV(estimator= classifiers[name],
param_grid=params[name],
scoring='roc_auc',
return_train_score=True,
n_jobs=-1, cv=3, verbose=1)
clasifier_dict[name].fit(X_train_resampled, y_train_resampled)
================> RF ================> Fitting 3 folds for each of 200 candidates, totalling 600 fits
# < ======= GridSeacrh del modelo Decision Tree ========= >
for name in list(params.keys()):
if name == "D_tree":
print("================>")
print(name)
print("================>")
clasifier_dict[name] = GridSearchCV(estimator= classifiers[name],
param_grid=params[name],
scoring='roc_auc',
return_train_score=True,
n_jobs=-1, cv=3, verbose=1)
clasifier_dict[name].fit(X_train_resampled, y_train_resampled)
================> D_tree ================> Fitting 3 folds for each of 120 candidates, totalling 360 fits
clasifier_dict
{'XGBoost': GridSearchCV(cv=3,
estimator=XGBClassifier(base_score=None, booster=None,
callbacks=None, colsample_bylevel=None,
colsample_bynode=None,
colsample_bytree=None, device=None,
early_stopping_rounds=None,
enable_categorical=False, eval_metric=None,
feature_types=None, gamma=None,
grow_policy=None, importance_type=None,
interaction_constraints=None,
learning_rate=None,...
max_leaves=None, min_child_weight=None,
missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=None,
n_jobs=None, num_parallel_tree=None,
random_state=314, ...),
n_jobs=-1,
param_grid={'eval_metric': ['logloss'],
'learning_rate': [0.001, 0.01, 0.1, 1.0],
'max_depth': range(1, 11),
'n_estimators': range(1, 11)},
return_train_score=True, scoring='roc_auc', verbose=1),
'Logistic_r': GridSearchCV(cv=3, estimator=LogisticRegression(random_state=314), n_jobs=-1,
param_grid={'C': [0.01, 0.03, 0.04, 0.05, 0.06],
'max_iter': [5000, 10000], 'penalty': ['l1'],
'solver': ['saga']},
return_train_score=True, scoring='roc_auc', verbose=1),
'GBoosting': GridSearchCV(cv=3, estimator=GradientBoostingClassifier(random_state=314),
n_jobs=-1,
param_grid={'learning_rate': [0.001, 0.01, 0.1, 1.0],
'max_depth': range(1, 11),
'n_estimators': range(1, 11)},
return_train_score=True, scoring='roc_auc', verbose=1),
'AdaBoost': GridSearchCV(cv=3,
estimator=AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=2,
random_state=314),
random_state=314),
n_jobs=-1,
param_grid={'learning_rate': [1e-05, 0.0001, 0.001, 0.01, 0.1,
1.0],
'n_estimators': range(1, 11)},
return_train_score=True, scoring='roc_auc', verbose=1),
'RF': GridSearchCV(cv=3,
estimator=RandomForestClassifier(n_jobs=-1, random_state=314),
n_jobs=-1,
param_grid={'max_depth': range(1, 11), 'min_samples_split': [2, 4],
'n_estimators': range(1, 11)},
return_train_score=True, scoring='roc_auc', verbose=1),
'D_tree': GridSearchCV(cv=3, estimator=DecisionTreeClassifier(random_state=314),
n_jobs=-1,
param_grid={'max_depth': [2, 4, 5, 6, 7, 8],
'min_samples_leaf': [1, 5, 10, 20, 35],
'min_samples_split': [2, 15, 30, 60]},
return_train_score=True, scoring='roc_auc', verbose=1)}
# ======= Results of the Model ==============
test_scores = []
for name in list(params.keys()):
try:
print(name)
print("===========>")
print("Number of configurations: {}\n".format(pd.DataFrame.from_dict(clasifier_dict[name].cv_results_).shape[0]))
aux = pd.DataFrame.from_dict(clasifier_dict[name].cv_results_).sort_values(by="mean_test_score",ascending=False).iloc[0]
roc_cv_test = aux['mean_test_score']
roc_cv_train = aux['mean_train_score']
roc_train = roc_auc_score(y_train_resampled, clasifier_dict[name].predict(X_train_resampled))
roc_test = roc_auc_score(y_test, clasifier_dict[name].predict(X_test_scaled))
test_scores.append((name,roc_train,roc_cv_train,roc_cv_test,roc_test))
except:
print("Model {} is not store in the database\n".format(name))
RF ===========> Number of configurations: 200 AdaBoost ===========> Number of configurations: 60 GBoosting ===========> Number of configurations: 400 XGBoost ===========> Number of configurations: 400 Logistic_r ===========> Number of configurations: 10 D_tree ===========> Number of configurations: 120
#Performance of the Models
resultado = pd.DataFrame(test_scores,columns=['Classifier','AUC_train','AUC_CV_train','AUC_validation','AUC_test'])
print("\n=========== Performance Table ============")
display(resultado)
best_clas = resultado.sort_values(by="AUC_validation",ascending=False).iloc[0]['Classifier']
print("El mejor clasificador es: {}".format(best_clas))
=========== Performance Table ============
| Classifier | AUC_train | AUC_CV_train | AUC_validation | AUC_test | |
|---|---|---|---|---|---|
| 0 | RF | 0.982555 | 0.999646 | 0.995155 | 0.940994 |
| 1 | AdaBoost | 0.987539 | 0.999339 | 0.998176 | 0.979315 |
| 2 | GBoosting | 1.000000 | 1.000000 | 0.999153 | 0.994111 |
| 3 | XGBoost | 1.000000 | 1.000000 | 0.999783 | 0.998671 |
| 4 | Logistic_r | 0.963449 | 0.990104 | 0.989697 | 0.951683 |
| 5 | D_tree | 0.990281 | 0.999219 | 0.994379 | 0.975179 |
El mejor clasificador es: XGBoost
Classifiers:
Datasets:
AUC (Area Under the Curve): An AUC score of 1 indicates perfect classification, while 0.5 suggests no discriminative power (equivalent to random guessing).
Insights:
Overfitting Check: Comparing AUC_train with AUC_test helps to identify potential overfitting. For instance, if AUC_train is very high but AUC_test is significantly lower, the model might be overfitting the training data. The XGBoost classifier has perfect scores on training and cross-validation, which could suggest potential overfitting. However, its high AUC_test indicates that it still performs exceptionally well on unseen data.
Cross-Validation Consistency: AUC_CV_train provides a measure of the model's performance on different subsets of the training data. If AUC_CV_train is close to AUC_train, it suggests the model is stable across different subsets of the training data.
# --- Get the best model
best_model = clasifier_dict["XGBoost"].best_estimator_
# --- Get feature importances from the best model
feature_importances = best_model.feature_importances_
# >>>> Confusion Matrix XGBoost <<<<
y_target = y_test
y_predicted = clasifier_dict["XGBoost"].predict(X_test_scaled)
cm = confusion_matrix(y_target,
y_predicted)
fig, ax = plot_confusion_matrix(conf_mat=cm)
plt.show()
precision = precision_score(y_target, y_predicted)
recall = recall_score(y_target, y_predicted)
F1=f1_score(y_target, y_predicted)
print('Precision: {:.2f} \nRecall {:.2f}\nF1 score {:.2f}'.format(precision *100, recall * 100,F1 * 100 ))
Precision: 98.85 Recall 100.00 F1 score 99.42
High Precision: The model is extremely good at ensuring that a majority of the customers it predicts as risky (predicted label: 1) are actually risky (true label: 1). This means that only a very small fraction of good customers are incorrectly labeled as risky. This will help the bank avoid giving loans to potentially risky customers.
Perfect Recall: A recall of 100% means that the model identifies all the risky customers (true label 1). No risky customer will go unnoticed, which is crucial for Bankaya as it reduces the potential for bad loans.
High F1 Score: An F1 score of 99.42% is an indication that there's a good balance between precision and recall. This further solidifies the model's reliability in differentiating between good and risky customers.
Minimal False Positives: The model only wrongly predicts 4 good customers as risky out of 1505 good customers. This is a very low number and indicates that the chance of a good customer being rejected is minimal.
No False Negatives: The fact that there are no false negatives (0 customers wrongly predicted as good when they are risky) is excellent for Bankaya. It means they won't be approving loans to customers who are likely to default.
import pandas as pd
import plotly.express as px
# I'm assuming 'model_numerical_vars' and 'feature_importances' are defined elsewhere in your code
df_importance = pd.DataFrame({
'Feature': model_numerical_vars,
'Importance': feature_importances
})
df_importance = df_importance.sort_values('Importance', ascending=True)
fig = px.bar(df_importance, x='Importance', y='Feature', orientation='h',
title='Feature Importance XGBoost', labels={'Importance': 'Importance Score'})
# Set the figure size
fig.update_layout(
autosize=False,
width=1000, # You can adjust the width to your preference
height=800, # You can adjust the height to make sure all labels are visible
margin=dict(
l=50, # left margin, increase if your y labels are getting cut off
r=50, # right margin
b=100, # bottom margin
t=100, # top margin
)
)
# You can also modify the font size of the labels if necessary
fig.update_layout(
title_font_size=20,
font=dict(
family="Arial, Balto, Courier New, Droid Sans", # Choose a font family
size=12, # Set the global font size
color="RebeccaPurple" # Set the font color
)
)
Conclusions:
Credit Approval Criteria:
Customer Segmentation:
Portfolio Health:
Recommendations:
Risk Mitigation:
Implement a tiered credit approval process where customers with indicators of higher risk (such as high max days late, a high number of inquiries, or high current balances) are subjected to more rigorous scrutiny or are offered products with different terms (e.g., lower credit limits or higher interest rates to offset risk).
Develop targeted financial education programs for new customers, especially in clusters with shorter tenures, to encourage financial responsibility.
Customer Engagement:
Design engagement and retention campaigns around the 6-7 month mark, which is the average time customers take from account creation to loan application. Develop loyalty programs and premium services for long-tenure customers to capitalize on their demonstrated loyalty.
Past Behavior is Crucial: Features related to past payment behavior, like max_days_late and last_payment_days, are vital. Bankaya should closely evaluate the payment history of applicants.
Diverse Financial Footprint: The variety of features from account types to credit types indicates that customers with a diverse financial footprint (various accounts, credits) can provide more data points to assess their reliability.
Application History: It's essential to consider how often a customer has applied for loans and how many have been approved. A high number of rejected applications might be a red flag.
Model Deployment and Monitoring:
False Positives and Negatives: